Hybrid MDP based integrated hierarchical Q-learning

SCIENCE CHINA Information Sciences . RESEARCH PAPERS . November 2011 Vol. 54 No. 11: 2279–2294 doi: 10.1007/s11432-011-4332-6 Hybrid MDP based inte...
Author: Marjory Gibbs
1 downloads 0 Views 2MB Size
SCIENCE CHINA Information Sciences

. RESEARCH PAPERS .

November 2011 Vol. 54 No. 11: 2279–2294 doi: 10.1007/s11432-011-4332-6

Hybrid MDP based integrated hierarchical Q-learning CHEN ChunLin1 , DONG DaoYi2,3 ∗ , LI Han-Xiong4 & TARN Tzyh-Jong5 1Department

of Control and System Engineering, and State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China; 2Institute of Cyber-Systems and Control, State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China; 3School of Engineering and Information Technology, University of New South Wales at the Australian Defence Force Academy, Canberra, ACT 2600, Australia; 4Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong 999077, China; 5Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA Received April 16, 2009; accepted April 11, 2010; published online August 11, 2011

Abstract As a widely used reinforcement learning method, Q-learning is bedeviled by the curse of dimensionality: The computational complexity grows dramatically with the size of state-action space. To combat this difficulty, an integrated hierarchical Q-learning framework is proposed based on the hybrid Markov decision process (MDP) using temporal abstraction instead of the simple MDP. The learning process is naturally organized into multiple levels of learning, e.g., quantitative (lower) level and qualitative (upper) level, which are modeled as MDP and semi-MDP (SMDP), respectively. This hierarchical control architecture constitutes a hybrid MDP as the model of hierarchical Q-learning, which bridges the two levels of learning. The proposed hierarchical Q-learning can scale up very well and speed up learning with the upper level learning process. Hence this approach is an effective integral learning and control scheme for complex problems. Several experiments are carried out using a puzzle problem in a gridworld environment and a navigation control problem for a mobile robot. The experimental results demonstrate the effectiveness and efficiency of the proposed approach. Keywords

reinforcement learning, hierarchical Q-learning, hybrid MDP, temporal abstraction

Citation Chen C L, Dong D Y, Li H X, et al. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279–2294, doi: 10.1007/s11432-011-4332-6

1

Introduction

Reinforcement learning (RL) [1, 2] addresses the learning issue of how an autonomous agent that percepts and acts in its environment can learn to choose optimal policies to achieve its goals and maximize the accumulation of a scalar value called reward in the learning process. RL is an active area of machine learning and has been extensively applied to problems ranging from operations research to robotics [3– 9]. Compared with other learning methods, RL is a learning technique based on trial-and-error. That is, RL involves approximating solutions to stochastic optimal control problems under the condition of ∗ Corresponding

author (email: [email protected])

c Science China Press and Springer-Verlag Berlin Heidelberg 2011 

info.scichina.com

www.springerlink.com

Chen C L, et al.

2280

Sci China Inf Sci

November 2011 Vol. 54 No. 11

incomplete knowledge of the system, where the feedback for the closed loop is an imprecise value called reward or penalty. The task of the agent is to learn from this approximating process through direct delayed reward to choose sequences of actions that produce the greatest cumulative reward [1]. Due to these intrinsic characteristics, an RL system can learn on-line the mapping from states to actions without prior knowledge of its environment which makes the system adaptable to initially unknown environments. However, for most existing RL methods such as the TD algorithm proposed by Sutton [10] and Qlearning of Watkins [11], many issues require further attention and some of them are longstanding challenges. One problem is that the learning process of RL is usually slow and it is bedeviled by the curse of dimensionality. The number of the parameters increases quickly along with the increasing state-action space, which dramatically slows down the learning process. Another problem is that RL is learning through numerical computation and is vulnerable to mimic human-level intelligence. It is difficult to implement such functions as information organization and task decomposition, which limits the scalable ability of an RL system. To overcome these difficulties, many researchers have made great efforts to improve RL algorithms. For example, fuzzy logic, grey theory and quantum computation are adopted for the generalization and speedup of RL [12–18]. Various factors are studied to explore the performance improvement for RL [19, 20]. The RL methods with hierarchical architecture have also been studied [21–25]. The hierarchical control architecture using temporal abstraction has been proven to be an effective approach to combat the curse of dimensionality and reveal human-like task decomposition abilities. However, most of the existing methods are based on modules with simple switch or just emphasize the detailed theoretical study of the upper level learning. These approaches neglect the coordination of the lower level and the upper level learning processes from the point of view of applications. Therefore this paper presents an integrated hierarchical Q-learning (HQL) framework based on the model of a hybrid MDP using temporal abstraction. The qualitative abstraction of an MDP will naturally turn to an SMDP. Based on qualitative algebra, a hybrid MDP may be defined through combining the MDP and SMDP, and constitutes the model of HQL. This approach will not only allow the HQL system to accomplish the learning process in an integral mode, but also will provide a mechanism to speed up the learning of the lower level via the upper level learning with a low-dimensional state-action space. Hence in the execution process of the proposed HQL algorithm, multiple levels of the hierarchical learning structure will be coordinated as integration and the performance of learning will be greatly improved. This paper is organized as follows. Section 2 introduces the standard Q-learning and reviews related hierarchical RL methods with an emphasis on the temporal abstraction. In section 3, a concept of hybrid MDP is defined on the combination of MDP and SMDP, and a hierarchical Q-learning algorithm is proposed based on the model of hybrid MDP. In section 4, two examples including a puzzle problem and a robot navigation problem are demonstrated to verify the performance of the proposed algorithm. Conclusions and future work are given in section 5.

2 2.1

Q-learning and hierarchical reinforcement learning Q-learning

Q-learning is a widely used RL algorithm in many areas and learns an evaluation function over stateaction pairs. It can be employed even when the agent has no prior knowledge of how its actions affect its environment, which makes it more flexible and adaptable than other RL algorithms. Its standard framework is based on the discrete-time finite Markov decision processes (MDPs) [1]. Definition 1 (MDP). A Markov decision process (MDP) is composed of the following five factors: {S, A(i) , pij (a), r(i,a) , V, i, j ∈ S, a ∈ A(i) }, where S is the state space; A(i) is the action space for state i; pij (a) is the probability for state transition; r(i,a) is the reward function for choosing action a at state i, r : Γ → (−∞, +∞), where Γ = {(i, a)|i ∈ S, a ∈ A(i) }; V is a criterion function or objective function. Based on the definition of MDP, RL algorithms learn through trial and error, and assume that the state S and action A(i) can be divided into discrete values. At a certain step t, the agent observes the state of

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2281

the environment (inside and outside of the agent) st , and then choose an action at . After executing the action, the agent receives a reward rt+1 , which reflects how good that action is (in a short-term sense), and the state of the environment will change into st+1 . Then the agent will choose the next action at+1 according to related knowledge. The goal of reinforcement learning is to learn a mapping from states to actions. That is, the agent is to learn a policy π : S × ∪i∈S A(i) → [0, 1], so that the expected sum of discounted reward of each state will be maximized: π = E{rt+1 + γrt+2 + . . . |st = s, π} = E[rt+1 + γVsπt+1 |st = s, π] V(s)   = π(s, a)[rsa + γ pass V(sπ ) ],

(1)

s

a∈A(s)

where γ ∈ [0, 1] is a discount factor, π(s, a) is the probability of selecting action a according to state s under policy π, pass = P r{st+1 = s |st = s, at = a} is the probability for state transition and rsa = E{rt+1 |st = s, at = a} is the expected one-step reward. We have the optimal state-value function:  ∗ V(s) = max [rsa + γ pass V(s∗  ) ], (2) a∈A(s)

s

π π∗ = arg max V(s) , ∀s ∈ S.

(3)

π

As for Q-learning which learns the value function over state-action pairs, similarly we have the following equations:  a pass V(sπ ) Qπ (s,a) = E{rt+1 + γrt+2 + . . . |(st , at ) = (s, a), π} = rs + γ =

rsa



 s

pass



s

π(s , a

a

a Q∗(s,a) = max Qπ (s,a) = rs + γ π



 s



)Qπ (s ,a ) ,

(4)

pass max Qπ (s ,a ) , 

(5)

a

where Qπ (s,a) stands for the value of taking action a in state s under policy π. The recursive definition of Q-function provides the basis for the Q-learning algorithm. Let αk ∈ [0, 1) be the learning rate. Then the one-step update rule of Q-learning is Q(st+1 ,a ) ). Q(st ,at ) ← (1 − αk )Q(st ,at ) + αk (rt+1 + γ max  a

(6)

Therefore Q-function is a predictive function that estimates the expected return from the current stateaction pair. Given accurate values of Q-function (called Q values), an optimal decision policy will select the action with the highest Q value for each state. The Q-function is learned through incremental dynamic programming, which maintains an estimate of Q values that are updated using immediate reward and estimated rewards from subsequent states. A typical Q-learning algorithm is shown as Algorithm 1 [1, 11]. Algorithm 1. One-step Q-learning algorithm 1. Initialize Q(s, a) arbitrarily. 2. Repeat for each episode until the learning process ends: 2.1 Initialize s and set k ← 0. 2.2 Repeat for each step of episode until s is terminal: 2.2 (i) Choose a for s using policy derived from Q (e.g., ε-greedy). 2.2 (ii) Take action a, observe s , r. Q(s, a) ← (1 − αk )Q(s, a) + αk [r + γ maxa Q(s , a )] s ← s , k ← k + 1. Using the one-step Q-learning algorithm the estimated Q-values of the agent converge to the actual Q-function if the system can be modeled as a deterministic MDP, the reward function r is bounded, and actions are chosen so that every state-action pair is visited infinitely often. As for a nondeterministic MDP model, its convergence will be guaranteed when extra conditions of the learning rates are satisfied [12]. Besides the Q-learning, there are many other effective standard RL algorithms, such as TD, SARSA, and Q(λ)-learning. For more details please refer to [1].

2282

Chen C L, et al.

Sci China Inf Sci

Agent 1 Agent 2

Switch Action

...

State

November 2011 Vol. 54 No. 11

Agent n

Figure 1

2.2

Sub-agent architecture with a switch for RL.

Hierarchical reinforcement learning

When standard Q-learning algorithms are applied to complex problems with a large state-action space, there are many limitations. For example, the curse of dimensionality occurs with the expansion of the state-action space, the environments have to be modeled as MDPs and the tasks are mainly confined to reactive ones. Hence standard Q-learning cannot scale up very well. To combat these difficulties, different approaches have already been proposed. For example, some work has addressed RL problems with infinite state-action spaces using function approximation [26], gradient-based approaches [27] and kernel-based indirect methods [28]. As for practical engineering applications, the RL system with hierarchical settings has been proven to be effective using space or temporal abstraction. The existing RL methods with a hierarchical architecture can mainly be classified into two categories: modular methods and hierarchical reinforcement learning methods. Modular methods are the most straightforward approach for complex tasks that divide the problem into modular parts. A “divide-and-conquer” policy is applied with sub-agents, and sub-agents solve conflicting desire through negotiation to determine whether there is a switch between these sub-agents. One of such approaches is known as W-learning. Hallerdal et al. [22] applied W-learning to behavior selection on a mobile robot. There are also other switch methods such as the simple switch method and hierarchical Q-learning method [23] which use another agent to learn the value function of Q(state, sub-agent) . Figure 1 shows the schematic sub-agent architecture for RL with a simple switch. The system is generally composed of several agents and only one of them can be active at a time. The agent selection is carried out using a switch, whose decision can also be learned through another agent. Another approach is hierarchical reinforcement learning and here “hierarchical” mainly emphasizes temporal abstraction. Although temporal abstraction and qualitative modeling approaches have been explored in AI for a long time, the research in the RL area has not begun until recently. Barto and Mahadevan [24] review several related approaches to temporal abstraction and hierarchical control. There are mainly three approaches to hierarchical reinforcement learning: 1) the options formalism [29], 2) the hierarchies of abstract machines (HAMs) approach [30], and 3) MAXQ value function decomposition framework [31]. Most of these approaches rely on the theory of semi-Markov decision process (SMDP) to provide a formal basis and use a simple type of abstraction which is called “macro”. In fact, hierarchical approaches in RL generalize the macro idea to closed-loop policies which are sometimes called temporallyextended actions, options, skills or behaviors. For example, in the options formalism, Sutton et al. [29] extended the usual notion of action to options which are closed-loop policies for taking action over a period of time. The options formalism shows that options enable temporally abstract knowledge and action to be included in the RL framework in a natural and general way. Theocharous [32] investigated in depth a hierarchical partially observable MDP (HPOMDP) model to scale learning and planning to large scale partially observable environments, in which the key ideas are also spatial and temporal abstraction. Most of the methods introduced above aim at representing knowledge flexibly at multiple levels of temporal abstraction to speed up planning and learning on large scale problems [24, 29]. However, these approaches are based on the framework of SMDP and emphasize the decomposition of tasks and the upper level learning process, which lacks a mechanism to guarantee the coordination of different components

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2283

Upper level Medium (optional)

Temporal abstraction

Lower level Time Figure 2

Temporal abstraction for a two-level hierarchy.

and learning levels. Hence following the idea of temporal abstraction, we formulate the abstraction from the point of view of quantitative-qualitative operation and present a hierarchical RL framework within the context of hybrid MDPs. Then a hierarchical Q-learning algorithm is proposed as an integrated learning algorithm for complex problems. It can coordinate multiple levels of temporal abstraction and speed up the learning process as well. Because the upper level is usually a qualitative abstraction of the lower level, in the following sections, the notions of lower level and upper level are also called quantitative level and qualitative level, respectively, and we will use them without discrimination.

3

Hierarchical Q-learning

The idea of hierarchical Q-learning for the hybrid control of robot navigation is initiated in [33]. In this section, the hierarchical Q-learning algorithm will be addressed in a theoretical and formal form. First, temporal abstraction is carried out on the model of MDP, where the upper level is regarded as the abstraction of the lower level and constitutes an SMDP. To combine the lower and upper levels in a uniform framework, the hybrid MDP defined is adopted as the model of hierarchical RL. Then an integrated hierarchical Q-learning algorithm is presented. 3.1

Temporal abstraction

Temporal abstraction is an effective technique to make a complex system tractable with a hierarchical structure. Besides temporal abstraction, spatial abstraction is another abstraction method that has been used in qualitative spatial reasoning. For sequential decision problems like most RL problems, the spatial abstraction can be transferred to temporal abstraction naturally. Here we only focus on temporal abstraction and the mechanism of a two-level (there may be three levels, but the medium level is optional) hierarchy using temporal abstraction is shown in Figure 2, where each bead represents a distinct state. The upper level is the temporal abstraction of the lower level, which means the time flows fast on the upper level and slowly on the lower level. The time scale of the medium level (optional) is between them. In qualitative algebra, the quantitative-qualitative issues are regarded as different angles of view on respective levels [34, 35]. From this point of view, we look upon the lower level and its temporal abstraction, i.e., the upper level, as the quantitative and qualitative representation, respectively. In the hierarchical RL framework, the upper level (qualitative) representation may be defined on the abstracting of the lower level (quantitative) representation. Compared with the standard RL based on MDP, the hierarchical RL framework is based on SMDP. As shown in Figure 3, SMDP may be constructed by the qualitative abstraction from MDP, where qsi is defined as the qualitative abstraction of a group of local states in a certain region and qai is the qualitative action corresponding to the counterpart of the sequential actions in this region. Then we have the following definition and theorem. Definition 2 (SMDP). Semi-Markov decision process (SMDP) is composed of the following six factors: {S, A(i) , pij (a), T(•|i,a,j) , r(u,i,a,j,τ ) , V, i, j ∈ S, a ∈ A(i) }, where S is state space; A(i) is action space for state i; pij(a) is probability for state transition from i to j with action a; the time of transition from state i to j is a non-negative stochastic variable T(•|i,a,j) ; if the transition time T(•|i,a,j) = τ , the reward during the transition process is r(u,i,a,j,τ ) in the time of [0, u] (u  τ ); if the transition completes, u = τ .

2284

Chen C L, et al.

Sci China Inf Sci

QState SMDP QAction

qs3

qs2

qs1

qa1

November 2011 Vol. 54 No. 11

... qa3

qa2

Qualitative abstraction State s 1 MDP ... Action s2 s3 s4 a1 a2 ... a3 a4

...

... sn ... an Time

Figure 3

Sketch of qualitative abstraction from MDP to SMDP.

Theorem 1. For an MDP: {S, A(i) , pij (a), r(i,a) , V, i, j ∈ S, a ∈ A(i) }, if QS = {qsi }, QS = ∅ is the qualitative universe of S, and QA = ∪QAqsi , QA = ∅ is the qualitative universe of A = ∪A(i) , then {QS, QA(i) , pij (qa), T(•|i,a,j) , r(u,i,qa,j,τ ) , V, i, j ∈ QS, a ∈ QA(i) } is an SMDP based on the qualitative abstraction of MDP {S, A(i) , pij (a), r(i,a) , V, i, j ∈ S, a ∈ A(i) }. V is a criterion function or objective function. Proof (Skipped). According to Definition 2, the conclusion of Theorem 1 is straightforward. Then the policy in an SMDP is defined as Qπ = {qπi } : QS × ∪i∈QS QA(i) → [0, 1]. Under the qπ policy Qπ, the qualitative state value function (with the state transition time τ ) is V(qs) = E{rt+1 + 2 τ −1 rt+τ + · · · }. Under the policy Qπ, if u = τ , let r(i,qa) = r(u,i,qa,j,τ ) |u=τ = γrt+2 + γ rt+3 + · · · + γ rt+1 + γrt+2 + γ 2 rt+3 + · · · + γ τ −1rt+τ , pij(qa) = P (qs |qs, qa) = γ τ P (qs , τ |qs, qa), where i = qs, j = qs , the SMDP: {QS, QA(i) , pij(qa) , T(•|i,a,j) , r(u,i,qa,j,τ ) , V, i, j ∈ QS, a ∈ QA(i) } is equivalent to the MDP: {QS, QA(i) , pij(qa) , r(i,qa) , V, i, j ∈ QS, a ∈ QA(i) }. It is clear that the SMDP of the qualitative environment model retains the characteristics of MDP. Under certain conditions, the techniques such as dynamic programming and reinforcement learning can also be adopted to solve the related problems, which will dramatically reduce the difficulties in solving these problems using the model of MDP instead of SMDP. The qualitative (upper) level learning is also modeled as an MDP and most of the standard RL methods are applicable without intrinsic modification. The qualitative state value function is qπ = E{r(qs,qa) + γ τ r(qs ,qa ) + · · · }, V(qs) ∗ V(qs) =

 max

qa∈QA(qs)

r(qs,qa) +



 ∗ P (qs |qs, qa)V(qs ) .

(7) (8)

qs

The qualitative state-action value function is τ   Qqπ (qs,qa) = E{r(qs,qa) + γ r(qs ,qa ) + · · · },

Q∗(qs,qa) = r(qs,qa) +

 qs

P (qs |qs, qa) maxqa ∈QA(qs ) Q∗(qs ,qa ) .

(9) (10)

For the updating of the qualitative state-action value function, accordingly the one-step update rule of Q-learning is Q(qs, qa) ← (1 − αk )Q(qs, qa) + αk (r(qs,qa) + γ τ maxqa Q(qs , qa )). 3.2

(11)

Hierarchical Q-learning

Learning on an upper level abstract space can speed up computing remarkably due to the dramatic reduction of state-action space. But qualitative abstraction of the original problem always loses some

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2285

information, which leads to new difficulties in gaining the right solutions. That is to say, we can only get a suboptimal policy and imprecise control sequences by learning and planning in a qualitative space. Hence it is necessary to optimize the learning process through taking most advantage of lower level precise information. The combination of the qualitative reasoning and quantitative computing is a good candidate approach. Based on qualitative algebra, we present the definition of hybrid MDP to formulate this hierarchical architecture. Definition 3 (hybrid MDP). SMDP: {QS, QA(i ) , pi j  (qa), T(•|i ,a,j  ) , r(u,i ,qa,j  ,τ ) , V, i , j  ∈ QS, qa ∈ QA(i ) } is the qualitative abstraction of MDP: {S, A(i ) , pi j  (a), r(i ,a) , V, i , j  ∈ S, a ∈ A(i ) }, let 1) HS = QS ∪ S, landmark state l ∈ HS, QS, S; 2) HA = QA ∪ A; 3) P = {pij }, where ⎧ ⎪ if i = i , j = j  , ⎪ pi j  (a) , ⎨ pij =

pi j  (qa) , if i = i , j = j  , ⎪ ⎪ ⎩ 0, otherwise.

4)

R=

r(u,i ,qa,j  ,τ ) , if i ∈ QS, r(i ,a) ,

if i ∈ S.

5) Hπ = π ∪ Qπ, π = {πi }, Qπ = {qπj }, where πi is for quantitative state si ∈ S, and qπj is for qualitative state qsj ∈ QS. The five factors {HS, HA, P, R, V } are defined as a hybrid MDP on the state universe of HS. Remark 1. According to Definition 3, compared with the definition of MDP, the hybrid MDP consists of two kinds of descriptions at lower level and upper level respectively, whose history is composed of two sequences of states and decisions: 1) on the lower level hn = (s0 , a0 , s1 , . . . , sn−1 , an−1 , sn ), n  0 with policy π = (π0 , π1 , . . . ); and 2) on the upper level: qhn = (qs0 , qa0 , qs1 , . . . , qsm−1 , qam−1 , qsm ), m  0 with policy Qπ = (qπ0 , qπ1 , . . . ). Hence the learning agent in a hybrid MDP learns two kinds of experience and computes at these two levels. At a certain step, the agent observes the state of the environment and represents it with st on the lower level and with qst on the upper level, where qst is usually the notion of a region that includes st . Then the agent chooses an action at and a qualitative action qat . After executing the action, on the lower level model the agent reaches next state st , receives a reward rt+1 for at and memorizes this experience. On the upper level, if the agent reaches next qualitative  for the qualitative action qat and memorizes this experience. If state qst+1 , it also receives a reward rt+1 the agent is still in qualitative state qst , the qualitative action is not completed and the agent continues to use the action until there is a transition between qualitative states, where the landmark shows the transition between them. Definition 3 shows a hybrid formulation for complex problems with abstraction, where landmark state is the key to connect different representation levels. According to the above discussion, Algorithm 2 shows a hierarchical Q-learning (HQL) algorithm based on the model of hybrid MDP {HS, HA, P, R, V }. Algorithm 2. HQL algorithm 1. Initialize Q(hs, ha) arbitrarily. 2. Repeat for each episode until the learning process ends: 2.1 Initialize hs, k ← 0. 2.2 Repeat for each step of episode until hs is terminal: 2.2.1 Choose ha for hs using policy derived from Q (e.g., ε-greedy). 2.2.2 Take action ha, observe R, hs . 2.2.2 (i) Qk (hs, ha) ← (1 − αk )Qk−1 (hs, ha) + αk [R + γ maxha Qk−1 (hs , ha )]. 2.2.2 (ii) If hs ∈ QS, update the Q value for corresponding landmark state l and action a,

Chen C L, et al.

2286

Sci China Inf Sci

November 2011 Vol. 54 No. 11

HQL algorithm a1

Q

a2

qa1

at

qa2

qat

s1 s2 State: st

amax=arg maxQ(s,a) a

st

State: st

ε -greedy

qs1 qs2

at

delay

qst

Reward: R

hs=st ha=at

Then update the Q value of landmark state Q(l, a)=(1− α Q(l, a)+ Q(hs, ha)

hs=qst ha=qat

Qk(hs,ha)ĕQk−1(hs,ha)+ηk[R+γ maxQk−1(hs′, ha′)−Qk−1(hs, ha) ha′

Action: at+1 Environment Figure 4

Diagram of a HQL system.

Q(l, a) = (1 − η)Q(l, a) + ηQ(ha, ha), a ∈ A, η ∈ [0, 1), η = η m , m > 1 is a scalar constant. 2.2.2 (iii) hs ← hs , k ← k + 1. The general setting for the HQL algorithm is shown in Figure 4. An agent exists in an environment described by some set of possible states HS. Each time it performs an action at in a state st , and the agent receives a reward R that indicates the immediate value of this state-action decision. If the state st corresponds to a qualitative state qst , the agent performs the counterpart steps in the qualitative state-action space and then transfers the learned experience in the upper level to the lower level through the updating of landmark state. The agent’s task is to learn a hybrid control policy: Hπ = {π, qπ}, where π : S → A is for quantitative state si ∈ S, and qπ : QS → QA is for qualitative state qsj ∈ QS, thus maximizing the expected sum of rewards. Theorem 2 (convergence of HQL). Consider a hierarchical Q-learning agent in a nondeterministic hybrid MDP with bounded rewards (∀hs, ha)|R(hs, ha)| < ∞. The agent uses a discount factor γ , such that 0  γ < 1, if lim

T →∞

T  k=1

αk = ∞,

lim

T →∞

T 

α2k < ∞,

(12)

k=1

then for all hs and ha, Qk (hs, ha) will converge to the optimal state-action value function Q∗ as k → ∞, with probability 1. Proof (Sketch). Bertsekas and Tsitsiklis [12] have proven that for standard Q-learning, using policy T T derived from Q (e.g., -greedy), if 1) limT →∞ k=1 αk = ∞ and 2) limT →∞ k=1 α2k < ∞, Qk → Q∗ as k → ∞, with probability 1. As for the proposed HQL algorithm, it is based on hybrid MDP {HS, HA, P, R, V } and is essentially composed of three parts: 1) Qualitative Q-learning on the upper level.

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2287

2) Quantitative Q-learning on the lower level. 3) The speeding up of the quantitative learning process via the qualitative learning process. As the qualitative and quantitative learning both have the Markov property and every state-action pair can be visited infinitely often, their convergence is guaranteed as the same as standard Q-learning. As long as the third part does not affect the convergence of the algorithm, Theorem 2 will be proven. In fact, for Algorithm 2, the learning step of updating quantitative Q values using qualitative Q values is n n η m (for the nth updating). If η ∈ [0, 1] and m > 1, the learning step η m → 0 quickly as n increases. Hence the third part does not affect the convergence of the algorithm. Different from Algorithm 1, Algorithm 2 is based on the model of hybrid MDP in that HQL learns on the hybrid state-action space, i.e., qualitative and quantitative state-action spaces, and keeps a close connection between these two spaces. It can be applied to any RL problems with a hierarchical model or processing structure. Algorithm 2 also presents a new solution to design the hierarchical RL systems, which is different from the existing hierarchical RL methods in at least two main aspects: 1) The model of Algorithm 2 emphasizes the abstraction of lower level structure instead of the decomposition of the complex problem, which means that this approach is from bottom to top. By abstraction, we have the upper level and the lower level representation, which are combined as a whole for the learning system; 2) in the learning process, the landmarks are only used to connect the upper level and the lower level, which is the key to speed up the lower level learning. They are not used as subgoals to navigate the agent to approach them. Hence the learned policy in an HQL system is an optimal one when the algorithm converges. Although the performance of HQL may be inferior to other hierarchical RL methods with good subtask specified, HQL is more practical in the applications and easy to be implemented. It is a compromise approach between optimistic expectations and practicality. As we know, almost all basic RL algorithms are “flat” methods that treat the state-action space as a huge flat search space, where the paths from the start state to the goal state are very long and the future rewards are difficult to be propagated backward along these paths [31]. Hence the learning of most RL algorithms is very slow and inefficient, especially for problems with high-dimension state-action space. That is, current RL methods do not scale well to complex problems. It is necessary to explore robust RL methodologies with new learning architectures to support efficient algorithms. The most natural idea is “divide-and-conquer”, which can be implemented using a hierarchical architecture by decomposing the problem into sub-problems, just like most of the hierarchical methods introduced in section 2.2. But there are also disadvantages or challenges for most of the existing hierarchical RL methods and two of them are remarkable. One challenge is how to make the decomposition and specify subtasks. Dynamic decomposition or abstraction is still an open question in this research area. The other is that the learned policy may be suboptimal. This usually occurs when the sub-goal is not properly chosen or it is not on the path of an optimal policy. Hence proper decomposition and sub-goal selection are very important and it is usually difficult to find the optimal policy. As for the proposed HQL method, an alternative solution is presented to design the hierarchical RL system, which emphasizes the abstraction of lower level structure instead of decomposition of the complex problem. After abstraction, the upper level is combined with the lower level representation and works as an integral model for the learning system based on the hybrid MDP. In the learning process, the landmarks are used to connect the upper level and the lower level instead of sub-goals, which leads to an optimal policy when the algorithm converges. These characteristics make HQL a practical and efficient approach for complex learning problems. Hence the merit of HQL is twofold. First, the upper level and the lower level learning processes are organized as an integrated one. The lower level learning is speeded up by the upper level learning which is carried out on a low-dimension state-action space. Second, the upper level representation is the qualitative abstraction of the lower level and it reveals much more causal relations about the problems supporting human-like qualitative reasoning. For example, when an HQL system is applied to the robot navigation, on the upper level, the qualitative states and actions may be represented as “room A”, “room B”, “corridor”, “move from room A to B through corridor”, etc. On the contrary, the representation on the lower level, which may be a metric map or a group of kinematical equations, cannot reveal the causal relations without further analysis by the designers or users.

2288

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

G

S

Figure 5

4

A puzzle problem with cell-to-cell actions.

Experiments and results

To demonstrate the effectiveness of the proposed HQL method, simulations using a puzzle problem and a robot navigation problem are performed. The puzzle problem is a classical platform to compare and assess various Q-learning algorithms in computer science. Here the HQL algorithm is compared with standard Q-learning and option Q-learning [29] (one of the typical RL methods with hierarchical settings as introduced in subsection 2.2) using the puzzle problem. More details about the factors that affect the performance of HQL algorithm, such as learning rate and exploration policy, are also studied thoroughly. Then the HQL algorithm is applied to the navigation control of a mobile robot. 4.1 4.1.1

Example 1: A puzzle problem Experimental setting

The puzzle problem is a 13 × 13 (0–12) gridworld example (Figure 5) and each cell corresponds to a distinct state. From any state the agent can perform one of four primary actions: up, down, left, right, and the actions that would lead into a blocked cell are not executed. The task of the learning agent is to find an optimal policy which will let the agent move from an initial state S to an appointed goal state G with minimized cost (or maximized rewards). As shown in Figure 5, the initial state S and the goal state G are cell(11, 1) and cell(6, 6), respectively. Before learning the agent has no knowledge about the environment at all. The agent moves around and learns through trial. All moving steps are punished by a reward of −1 until the agent finds the goal state, then it receives a reward of 100 and ends this episode. Hence on the lower level, the state set and action set are State set: S = {cell(i, j)}i,j=0,1,...,12 , Action set: A = {up, down, left, right}. We may apply temporal abstraction on the lower level model to abstract the qualitative information. The temporal abstraction may be accomplished by searching for the potential landmarks and decomposing the whole gridworld into a few regions as shown in Figure 6, which will be represented as topological nodes. The label “#” in Figure 6 indicates potential landmarks that can represent special states. Then the state set and action set on the upper level are Qualitative state set: QS = {Sn}n=0,1,...,9 , Qualitative action set: QA = {arcs|(S0, S1), (S1, S8), . . . , (S5, S9)}, where {Sn}n=0,1,...,9 corresponds to the topological notes and the action set is made up of all the arcs that connect these notes.

Chen C L, et al.

Sci China Inf Sci

2289

November 2011 Vol. 54 No. 11

Potential landmarks

S1

S2 S8 S7

S6

G S9 S5 S S Figure 6

S0

S4

S3

Topological information abstraction using landmarks.

In HQL, the relationships between the two levels can be maintained by those landmarks. As shown in Figure 6, the landmarks may be chosen as l = {cell(i, j)} ⊂ S, where (i, j) ∈ {(1, 1), (1, 11), (11, 1), (11, 11), (3, 3), (3, 9), (9, 3), (9, 9), (2, 6), (8, 6)}, and all the landmarks correspond to the related elements in QS. Abstraction is the key to hierarchical RL systems, which constitutes the upper level learning and provides the mechanism to speed up the lower level learning. As for the proposed HQL method, landmarks and sub-goals can both be adopted for the temporal abstraction of the lower level learning model. Generally speaking, sub-goals can usually be selected using task decomposition, while landmarks are mostly selected through spatial abstraction or temporal abstraction. Since the HQL method lays emphasis on abstraction instead of decomposition, we use the notion of landmark instead of sub-goal although sometimes the landmarks are also sub-goals. Landmarks are the “natural joints” that break through a discrete set of many lower level states (or a continuous set of states) into qualitatively distinct regions [33, 34]. A landmark value is a symbolic name for a particular state, which serves as a precise boundary for a qualitative region. The qualitative properties of a region depend primarily on its ordinal relations with the landmarks. As shown in Figure 6, the potential landmarks are selected according to special environment features. For example, cell(1, 1), cell(1, 11), cell(11, 1) and cell(11, 11) correspond to corners; cell(2, 6) and cell(8, 6) correspond to entrances. For the proposed HQL approach, the methods of abstraction and landmark selection may differ according to different practical problems. For example, when the HQL algorithm is applied to mobile robot navigation, the landmarks can be selected by searching for distinct environment features. The abstraction can be accomplished by abstracting the topological information from local environment information (e.g., metric map), where the process of abstraction and landmark selection can be carried out online along with the robot learning and navigating. 4.1.2

Results and analysis

To evaluate the performance of the proposed HQL algorithm, we carry out three groups of experiments. In the first experiment, the performance of HQL is compared with option Q-learning and standard Qlearning under the same parameter settings. In the second experiment, the performance of HQL is studied with different learning rates ranging from 0.01 to 0.10. The issues of the tradeoff between exploration and exploitation are also discussed based on the experiments with different parameters for -greedy policy. For all the experiments, the learning step is initialized as η = 0.1 and the scalar constant m = 1.2 for the HQL algorithm. 1) Performance of HQL compared with option Q-learning and standard Q-learning. The performance comparison between HQL, option Q-learning (option QL) and standard Q-learning (QL) algorithm is

Chen C L, et al.

2290

Sci China Inf Sci

November 2011 Vol. 54 No. 11

Performance of HQL

1000 900

Steps per episode

800

QL converges after about 4300 episodes

700 600 QL

500

Option QL converges after about 3600 episodes

400 300 Option QL 200 HQL 100 Optimal (32 steps) 0 0 1000 2000

Figure 7

HQL converges after about 3400 episodes

3000 Episodes

4000

5000

6000

Performance of HQL compared with option QL and QL.

900

900

800

800

700

700

600

Steps per episode

Steps per episode

Learning rate α = 0.05

500 400

QL

300 200

600 500

QL Learning rate α = 0.20

400 300

HQL

100

100

0 0 1000 2000 3000 4000 5000 6000 Episodes Figure 8

200 HQL

0 0 1000 2000 3000 4000 5000 6000 Episodes

Performance of HQL and QL with different learning rates.

shown in Figure 7. The experimental settings for all of them are as follows: exploration policy -greedy with  = 0.01, discounted factor γ = 0.9 and learning rate α = 0.01. It is clear that, with the speeding up of upper level learning, HQL learns faster than option Q-learning and the standard Q-learning under the same experimental settings. From Figure 7, we can see that the HQL algorithm converges to the optimal policy (32 steps from S to G) after about 3400 episodes with the average steps per episode of about 100 steps, whereas before converging the option Q-learning algorithm needs about 3600 episodes with the average steps per episode of about 200 steps, and the Q-learning algorithm needs 4300 episodes with the average steps per episode of about 300 steps. Although HQL and option QL are both RL methods with hierarchical settings, the HQL algorithm shows better performance because it possesses closer relations between different learning levels and takes most advantage of the higher level information. 2) Performance of HQL with different learning rates. To evaluate the performance of HQL with different learning rates, we change the learning rate α from 0.01 to 0.20 and the results are shown in Figures 8–10. Figure 8 shows that when the learning rate α is set at 0.05, both HQL and Q-learning algorithms learn faster than α = 0.01. But when the learning rate α is getting larger, the Q-learning algorithm starts to perform worse. For example, the Q-learning algorithm cannot converge to the optimal policy with the

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2291

250

150

α =0.04

Steps per episode

Steps per episode

200

α =0.02

100

α =0.06 50

0

α =0.20 α =0.10 α =0.08 500 1000 1500 2000 0 Episodes

Figure 9 rates.

2500

3000

Performance of HQL with different learning

400 350 300 250 200 150 100 50 0 0.10 0.08 0.06 0.04 Learning rate 0.02

6000 5000 4000 3000 2000 Episodes 0 0 1000

Figure 10 HQL performance with learning rates from 0.01 to 0.10.

learning rate α = 0.20, while the HQL algorithm still learns steadily and fast. The results show that HQL is more robust than Q-learning with different learning rates and its parameters are easier to be tuned for the learning system to converge. Figures 9 and 10 show more details about the performance of HQL with different learning rates ranging from 0.02 to 0.20 and from 0.01 to 0.10, respectively. As shown in Figure 9, when the learning rate α changes from 0.02 to 0.20, the episodes needed before converging to the optimal policy change from about 1800 to 200, which means that with a larger learning rate the learning is speeded up and quickly converges. But for standard Q-learning, it will tend to diverge with a large learning rate. This performance improvement of HQL is due to the upper level learning process that works mainly in two aspects: 1) The updating of landmark states takes most advantage of the upper level experience and speeds up the learning process of the lower level learning; 2) the updating for landmark states also works as the direction for the lower level learning and prevents it from varying too much from the right decision. So the HQL system learns faster than standard Q-learning and is not apt to diverge even when the learning rate becomes larger. This result is exciting, but we shall still be careful about the choosing of learning rate, because if the upper level learning is not as good as we hope or the information is incomplete, the performance of HQL may be affected. Nevertheless the HQL algorithm shows a much more robust performance with a relatively large range of learning rate. Figure 10 shows the changing performance of HQL along with the changing of learning rate α from 0.01 to 0.10, which also manifests that a larger learning rate speeds up the learning process and does not affect the convergence of the HQL algorithm. 3) Exploration strategy. Exploration strategy is an important factor for the performance of RL algorithms [1, 36, 37]. Exploration is the strategy based on the assumption that the agent selects a nonoptimal action in the current situation to obtain more knowledge about the problem. This strategy allows the learning agent to neglect the local optimal policies and to search for the globally optimal one. But excessive exploration will decrease the performance of a learning algorithm. How to keep the balance between exploitation and exploration is a key technique for RL algorithms. In our experiments, we choose -greedy strategy ( ∈ [0, 1)), where larger  corresponds to a larger probability of exploration. The performance of the HQL algorithm with  ranging from 0.01 to 0.10 is studied in the third group of experiments. The other parameters are set as: discounted factor γ = 0.9 and learning rate α = 0.01, where a small learning rate is chosen so that the learning process is steady and we can focus on the performance changing with different . Figure 11 shows that there is little difference between the performance of HQL with  = 0.01 and  = 0.10. With larger , the learning is a little slower and there is still slight vibration when the learning process is already near to the optimal policy, because the agent still explores when the learning converges. The results indicate that HQL algorithms are not as sensitive as standard Q-learning algorithms to the portion of exploration. HQL is safer and avoids getting into a local optimal policy and does not need too much exploration comparatively. Hence its performance will be slightly better with little exploration.

Chen C L, et al.

2292

Sci China Inf Sci

November 2011 Vol. 54 No. 11

300

ε =0.1

200

Steps per episode

Steps per episode

250

150

ε =0.01 100 50 Optimal (32 steps) 0 0

1000

2000

3000 4000 Episodes

5000

6000

Figure 11 Performance comparison of HQL with ε = 0.01 and 0.10.

400 350 300 250 200 150 100 50 0 0.10 0.08

0.06 ε 0.04 0.02

6000 5000 4000 3000 2000 Episodes 0 0 1000

Figure 12 Performance of HQL when ε changing from 0.01 to 0.10.

Target

Obstacles Trajectory of robot

Start

Figure 13

Robot navigation in a large scale environment.

Figure 12 shows the performance of HQL with  changing from 0.01 to 0.10. With  getting larger, the performance of HQL does not change much in the criterion of learning speed and the quality of solutions, except that there is some vibration when the algorithm converges, which shows there is still much exploration after finding the optimal policy due to the fact that  does not decay along with the learning process. Our intent here is to show the relations between the exploration strategies with the performance of HQL. In applications  will be set to slowly decay to 0 and there will not be any vibration when the learning converges to the optimal policy. 4.2

Example 2: A robot navigation task

To test the feasibility of HQL in a more practical and complicated case, a simulation environment is built up with the setting of 600 × 400 (grid representation) for the navigation control of a mobile robot. As shown in Figure 13, in the typical indoor environment, the topological information (landmarks) may be acquired online using environment features, which can be interpreted from the sensor information. The parameter setting for the learning algorithms is as follows: exploration policy -greedy,  = 0.1, discounted factor γ = 0.9, learning rate η = 0.01, all the Q values are initialized at 0. Figure 13 shows the experimental results of navigation in a large area with some obstacles. The “Goal” is set in a different room far from the initial state of the robot. The robot has to go through two doors and a narrow corridor before it reaches the goal state. The results indicate that the robot navigates to the global goal without being trapped in a local minimum through learning. The navigation control using

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

2293

hierarchical Q-learning makes the mobile robot competent to navigate in a large unknown environment and to adapt to dynamic environments as well. The hierarchical Q-learning algorithm also speeds up the learning process due to its intrinsic structure that takes most advantage of different levels.

5

Conclusions

The hierarchical architecture that allows learning, planning and representations of knowledge at multiple levels of temporal abstraction is the key to the solution of complex problems, which is a longstanding challenge for AI and cybernetics. This paper aims at speeding up the learning and provides an alternative practical approach of accomplishing the hierarchical setting. We propose an integrated hierarchical Q-learning algorithm based on hybrid MDP via temporal abstraction. Two examples are given to demonstrate the performance of the proposed approach. The results show that the hierarchical Q-learning is superior to Option Q-learning and standard Q-learning. Hierarchical Q-learning learns faster with large state-action space. Moreover, it is robust with a large range of learning rates and its performance is not sensitive to the exploration strategy. All these characteristics are due to the hierarchical architecture and the mechanism of coordination between the upper level and the lower level learning. In this paper, we emphasize the basic theory of the proposed hierarchical Q-learning and its performance on several typical tasks. An expansion to other RL methods and complex applications is an area for future work.

Acknowledgements This work was supported by the National Natural Science Foundation of China (Grant Nos. 60805029, 60703083), the National Creative Research Groups Science Foundation of China (Grant No. 60721062), the Fundamental Research Foundations for the Central Universities (Grant No. 2010QNA5014), the research grants from City University of Hong Kong (Grant Nos. 7008057, 9360131) and in part supported by the Australian Research Council (Grant No. DP1095540).

References 1 Sutton R, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 1998. 133–156 2 Feng Z Y, Liang L T, Tan L, et al. Q-learning based heterogenous network self-optimization for reconfigurable network with CPC assistance. Sci China Ser F-Inf Sci, 2009, 52: 2360–2368 3 He P, Jagannathan S. Reinforcement learning-based output feedback control of nonlinear systems with input constraints. IEEE Trans Syst Man Cybern Part B-Cybern, 2005, 35: 150–154 4 Kondo T, Ito K. A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robot Auton Syst, 2004, 46: 111–124 5 Morimoto J, Doya K. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robot Auton Syst, 2001, 36: 37–51 6 Chen C, Dong D. Grey system based reactive navigation of mobile robots using reinforcement learning. Int J Innov Comp Inf Control, 2010, 6: 789–800 7 Cheng D Z. Advances in automation and control research in China. Sci China Ser F-Inf Sci, 2009, 52: 1954–1963 8 Yung N H C, Ye C. An intelligent mobile vehicle navigator based on fuzzy logic and reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 1999, 29: 314–321 9 Montesanto A, Tascini G, Puliti P, et al. Navigation with memory in a partially observable environment. Robot Auton Syst, 2006, 54: 84–94 10 Sutton R. Learning to predict by the methods of temporal difference. Mach Learn, 1988, 3: 9–44 11 Watkins J C H, Dayan P. Q-learning. Mach Learn, 1992, 8: 279–292 12 Bertsekas D P, Tsitsiklis J N. Neuro-dynamic Programming. Belmont: Athena Scientific, 1996. 36–51 13 Chen C, Dong D, Chen Z. Grey reinforcement learning for incomplete information processing. Lect Notes Comput Sci, 2006, 3959: 399–407 14 Dong D, Chen C, Li H, et al. Quantum reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 1207–1220 15 Dong D, Chen C, Tarn T J, et al. Incoherent control of quantum systems with wavefunction controllable subspaces via quantum reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 957–962

2294

Chen C L, et al.

Sci China Inf Sci

November 2011 Vol. 54 No. 11

16 Chen C, Dong D, Chen Z. Quantum computation for action selection using reinforcement learning. Int J Quantum Inf, 2006, 4: 1071–1083 17 Dong D, Chen C, Chen Z, et al. Quantum mechanics helps in learning for more intelligent robots. Chin Phys Lett, 2006, 23: 1691-1694 18 Dong D, Chen C, Zhang C, et al. Quantum robot: structure, algorithms and applications. Robotica, 2006, 24: 513–521 19 Jing P, Ronald J W. Increment multi-step Q-learning. Mach Learn, 1996, 22: 283–291 20 Mahadevan S. Average reward reinforcement learning: Foundations, algorithms and empirical results. Mach Learn, 1996, 22: 159–195 21 Althaus P, Christensen H I. Smooth task switching through behavior competition. Robot Auton Syst, 2003, 44: 241–249 22 Hallerdal M, Hallamy J. Behavior selection on a mobile robot using W-learning. In: Hallam B, Floreano D, Hallam J, et al., eds. Proceedings of the Seventh International Conference on the Simulation of Adaptive Behavior on from animals to animates, Edinburgh, UK, 2002. 93–102 23 Wiering M, Schmidhuber J. HQ-Learning. Adapt Behav, 1997, 6: 219–246 24 Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning. Discret Event Dyn Syst-Theory Appl, 2003, 13: 41–77 25 Chen C, Chen Z. Reinforcement learning for mobile robot: From reaction to deliberation. J Syst Eng Electron, 2005, 16: 611–617 26 Tsitsiklis J N, VanRoy B. An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control, 1997, 42: 674–690 27 Sutton R S, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst, 2000, 12: 1057–1063 28 Ormoneit D, Sen S. Kernel-based reinforcement learning. Mach Learn, 2002, 49: 161–178 29 Sutton R, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif Intell, 1999, 112: 181–211 30 Parr P, Russell S. Reinforcement learning with hierarchies of machines. Adv Neural Inf Process Syst, 1998, 10: 1043–1049 31 Dietterich T G. Hierarchical reinforcement learning with the Maxq value function decomposition. J Artif Intell Res, 2000, 13: 227–303 32 Theocharous G. Hierarchical learning and planning in partially observable Markov decision processes. Dissertation for Doctoral Degree. East Lansing: Michigan State University, USA, 2002. 30–72 33 Chen C, Li H, Dong D. Hybrid control for autonomous mobile robot navigation—a hierarchical Q-learning algorithm. IEEE Robot Autom Mag, 2008, 15: 37–47 34 Kuipers B. Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge. Cambridge: MIT Press, 1994. 1–27 35 Berleant D, Kuipers B. Qualitative and quantitative simulation: Bridging the gap. Artif Intell, 1997, 95: 215–255 36 Guo M Z, Liu Y, Malec J. A new Q-learning algorithm based on the metropolis criterion. IEEE Trans Syst Man Cybern Part B-Cybern, 2004, 34: 2140–2143 37 Dong D, Chen C, Chu J, et al. Robust quantum-inspired reinforcement learning for robot navigation. IEEE-ASME Trans Mechatron, 2011, in press

Suggest Documents