DEVELOPMENT AND STUDY OF A CONTROLLED MARKOV DECISION MODEL OF A DYNAMIC SYSTEM BY MEANS OF DATA MINING TECHNIQUES

RIGA TECHNICAL UNIVERSITY Faculty of Computer Science and Information Technology Institute of Information Technology Jurijs Čižovs Management Informa...
Author: Jodie Tucker
1 downloads 1 Views 1MB Size
RIGA TECHNICAL UNIVERSITY Faculty of Computer Science and Information Technology Institute of Information Technology

Jurijs Čižovs Management Information Technology doctoral programme student

DEVELOPMENT AND STUDY OF A CONTROLLED MARKOV DECISION MODEL OF A DYNAMIC SYSTEM BY MEANS OF DATA MINING TECHNIQUES

Ph.D. Thesis Summary

Scientific supervisor Dr.habil.sc.comp., Professor

A. BORISOVS

Riga 2012 1

UDK 519.857(043.2) Či 958 d

Čižovs J. Development and study of a controlled Markov decision model of a dynamic system by means of data mining techniques. Ph.D. Thesis Summary.-R.: RTU, 2012.-38 p.

Printed according to the decision of the RTU Information Technology Institute Council meeting, January 10, 2012, protocol No. 12-02.

This work has been partly supported by the European Social Fund within the National Programme „Support for the carrying out doctoral study programme’s and post-doctoral researches” project „Support for the development of doctoral studies at Riga Technical University”.

ISBN 978-9934-10-272-1 2

PH.D. THESIS IS NOMINATED IN RIGA TECHNICAL UNIVERSITY FOR TAKING DOCTOR’S DEGREE IN ENGINEERING SCIENCE Defence of the Ph.D. Thesis for obtaining a doctor’s degree in Engineering Science will take place on 12 March, 2012 at Riga Technical University, Department of Computer Science and Information Technology, 1/3 Meža Street, Room 202.

OFFICIAL REVIEWERS Professor, Dr.math. Kārlis Šadurskis Riga Technical University, Latvia Professor, Dr.habil.sc.ing. Jevgeņijs Kopitovs Transport and Telecommunication Institute, Latvia Professor, Dr. rer. nat. habil. Juri Tolujew Otto-von-Guericke University Magdeburg, Germany

DECLARATION I, Jurijs Čižovs, declare that I have written this Thesis, which is submitted for reviewing at Riga Technical University for taking a doctor’s degree in Engineering Science. Jurijs Čižovs

……………………………….. signature

Date:

3 February, 2012

Ph.D. Thesis is written in Latvian. It contains an introduction, 6 chapters, a conclusion, a list of references, 4 appendixes, 70 figures, 15 tables and 51 formulae. The list of references contains 83 entries. There are 137 pages in total.

3

GENERAL DESCRIPTION OF THE THESIS Introduction With the development of electronic management of production, trade, finance, etc., new features of structured data storage have appeared which reflect the economic activities of an enterprise. The analysis of the available data of the enterprise activity in the past aimed to develop relevant management decisions is one of the mechanisms which determine the efficient enterprise management. Since the enterprise activity is observed over time, the data are multidimensional time series. Thus, there arises the problem of decision making under uncertainty with the data which are multidimensional time series. The uncertainty stems from the fact that it is technically impossible to reflect all the internal and external factors affecting the parameters observed. Topicality of problem The mathematical framework of Markov Decision Process (MDP) has been used successfully to find the optimal management strategy in discrete stochastic processes developing over time. There are a number of modifications and enhancements aimed at solving tasks with continuous parameters, partially observable environments, etc. However, the issues related to the building of an MDP-model which contains the data represented as time series, are open for research. The complexity of the model building is due to the requirements of the MDP framework to the structure of the researched data. For the observed parameters the implementation of a certain type time series must be extracted from the relational data and converted to the structure of MDP. The extension of the framework for working with time series allows one to take advantage of a standard MDP framework to make decisions on economic problems in online mode. Goal of the research The goal of the doctoral thesis is to develop the decision making framework based on the Markov Decision Process for the dynamic systems in which the data are represented as time series. To achieve the goal stated, the following tasks have to be solved: 1. To review the current state of the MDP framework application to the problems expressed in terms of multidimensional time series and to explore the existing approaches. 2. To develop a method based on data mining to build time series, to process and transform them into the structures that satisfy the MDP requirements. 3. To develop the new method’s software which is based on the agent-oriented architecture and is an intelligent system for decision making and support. 4. To consider the possibility of improving the method by the decision space approximation implemented by means of Artificial Neural Networks. 5. To perform the statement of Dynamic Pricing Policy problem task as a dynamic programming task (in the context of Markov Decision Processes) in order to obtain an object for practical experiments. 6. To test the obtained decision making intelligent agent system in the problem of Dynamic Pricing Policy for assessing the effectiveness of the developed method for the real-world problems. Object of the research The object of the research is the advanced decision making method based on Markov Decision Process. The scope of the framework application is oriented towards dynamic programming tasks which contain the data expressed as time series. 4

Research hypotheses The study put forward the following hypotheses: 1. the task in which the data are represented as time series can be viewed as a task of dynamic programming and expressed in terms of Markov Decision Process; 2.

the approximation of the state space and decision space of Markov Decision Process can be performed with the help of the artificial neural network approach.

Methods of the research In the Ph.D. Thesis the advanced decision making method based on Markov Decision Process is under central consideration. The maximum-likelihood technique, which is a statistical method for estimating the unknown parameter, is used to construct the probabilistic model in framework of the apparatus. Data mining techniques including tools for data normalization, clustering and classification are employed. The methods of computational intelligence: Reinforcement Learning and Artificial Neural Networks are used. The agent-oriented architecture is used for the software systems under development. Scientific novelty The decision making method based on Markov Decision Process is of scientific interest. The main characteristic that distinguishes it from a standard Markov Decision Process is the possibility to use it in the tasks with multidimensional time series. The approach of the decision space approximation by means of Artificial Neural Networks has been demonstrated for the task of Dynamic Pricing Policy (which is characterized by multidimensional time series). Besides, a new approach to building the agent-based system architecture is provided. This approach allows one to avoid the conflict between the definition of the purpose and the agent environment in case the agent does not directly interact with the object of the task to be solved. Practical use of the Thesis and approbation The developed decision making method based on Markov Decision Process is designed for tasks in which the state of the system is described by parameters that change over time and not by static parameters. The practical application of the intellectual agent system based on Markov Decision Process was demonstrated in the task of Dynamic Pricing Policy. The testing data are the actual sales records of the real manufacturing and trade management system 1С:Enterprise v7. The data cover a two year period of manufactured food products sales. The development of intelligent agent system based on Markov Decision Process was implemented in 1C: Enterprise v7 framework, which allowed a direct access to sales data. This work includes a series of experiments of several sub-systems (Artificial Neural Networks, Markov Decision Process) with toy problems. Besides, a series of experiments by the example of Dynamic Pricing Policies task in order to numerically evaluate the effectiveness of the improved MDP framework was carried out. Some certain stages of work and the results were presented at these scientific conferences: 1. Chizhov J. An Agent-based Approach to the Dynamic Price Problem, 5th International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization KES-AMSTA/ABO'2011, 29 June – 1 July, 2011, Manchester, United Kingdom.  indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.

5

2. Chizhov J., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on Markov Decision Process, 9th International Conference on Application of Fuzzy Systems and Soft Computing, ICAFS 2010, 26 – 27 August, 2010, Prague, Czech Republic. 3. Chizhov J., Kuleshova G., Borisov A. Time Series Clustering Approach for Decision Support, 16th International Multi-conference on Advanced Computer Systems ACS-AISBIS 2009, 14 – 16 October, 2009, Miedzyzdroje, Poland.  indexed in: Scopus, Web of Science

4. Chizhov J., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer Behaviour Patterns, Data Mining in Marketing DMM’ 2009, 9th Industrial Conference, ICDM 2009, 22 – 24 July, 2009, Leipzig, Germany.  indexed in: DBLP, Io-port.net.

5. Chizhov J., Borisov A. Applying Q-Learning to Non-Markovian Environments, First International Conference on Agents and Artificial Intelligence (ICAART 2009), 19 – 21 January, 2009, Porto, Portugal.  indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.

6. Chizhov J., Zmanovska T., Borisov A. Ambiguous States Determining in Non-Markovian Environments, RTU 49th International Scientific Conference, Subsection “Information Technology and Management Science”. 13 October, 2008, Riga, Latvia.  indexed in: EBSCO

7. Chizhov J. Particulars of Neural Networks Applying in Reinforcement Learning, 14th International Conference on Soft Computing “MENDEL 2008”, 18 – 20 June, 2008, Brno University of Technology, Brno, Czech Republic.  indexed in: ISI Web of Knowledge, INSPEC

8. Chizhov J. Reinforcement Learning with Function Approximation: Survey and Practice Experience, International Conference on Modeling of Business, Industrial and Transport Systems “MBITS’08”, 7 – 10 May, 2008, Transport and Telecommunication Institute, Riga, Latvia.  indexed in: ISI Web of Knowledge

9. Chizhov J. Software Agent Developing: a Practical Experience, RTU 48th International Scientific Conference, Subsection “Information Technology and Management Science”, 12 October, 2007, Riga Technical University, Riga. 10. Chizhov J., Borisov A. Increasing the Effectiveness of Reinforcement Learning by Modifying the Procedure of Q-table Values Update, Fourth International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control “ICSCCW – 2007”, 27 – 28 August, 2007, Antalya, Turkey. 11. Chizhov J. Agent Control in World with Non-Markovian Property, EWSCS’07: Estonian Winter School in Computer Science. 4 – 9 March, 2007, Palmse, Estonia. Publications Several fragments of the Ph.D. thesis, as well as its results are published in 11 scientific articles. Most of the publications are indexed by international digital libraries (Springer, ISI WEB, SCOPUS, DBLP, Io-port.net). The list of publications is included in the complete list of literature provided at the end of the author's abstract. Main results of the Ph.D. Thesis The decision making method based on the MDP and ensuring MDP-model building in tasks that contain the data presented as multidimensional time series was developed as a part of the doctoral work. The method was tested in a series of experiments. As a result, the numerical 6

estimates allowing us to conclude that the method is able to build a MDP-model, which adequately displays the learning sample, were obtained. The following tasks were solved and the results obtained: 1. The review of mathematical methods based on Markov Decision Processes allows one to conclude that MDP and Reinforcement Learning can be considered an effective method for modelling the dynamic systems; some key problems of their use in tasks that contain data as multidimensional time series were defined. 2. The analysis of several computational intelligence techniques (ANN, RL, agent based systems, methods of Data Mining, etc.) allowed to describe the main features of the developed method (pipeline organization to transform the data, timely updates of MDP-model and so on). 3. Special agent based architecture was developed to avoid an incorrect description of the interaction of intelligent system with the environment. 4. The approximation of the state space and decision space of MDP with the use of Artificial Neural Network was implemented. The efficiency of the approach was demonstrated by the toy problem. 5. The intermediate structure (the profile of a studied value’s behaviour) for storing and processing the identified patterns of the investigated time series was formulated. The behaviour profiles of the studied values are used to create the MDP-model. 6. The approach of using different criteria for clustering time series (Euclidean distance and the shape-based similarity) according to the semantic load of each studied variable was suggested. 7. For the purpose of testing the method, the problem of Dynamic Pricing Policy within the MDP framework was formulated, and the software for implementing the experiments was developed. 8. A series of experiments to quantify the effectiveness of the MDP-model building was carried out. The assessment was based on the comparison of the resulting model and the training sample as well as the application of the model on data outside the training set. Structure and contents of the thesis Ph.D. Thesis consists of an introduction, 6 chapters, a conclusion, a list of references and 4 appendices. Ph.D. Thesis comprises 137 pages, it includes 70 figures and 15 tables. There are 83 sources in the list of references. The structure of the Ph.D. Thesis is the following: INTRODUCTION – the terminology used in the research process is introduced, and the subject of the research, the objective of the paper and the tasks are formulated. 1 CHAPTER: THE MULTISTEP PROCESS OF DECISION MAKING IN THE DYNAMIC SYSTEMS – the analysis of the multistep decision making method based on the Markov Decision Processes is provided in this chapter. The key advantages, as well as disadvantages that determine the development of an improved MDP apparatus are described. 2 CHAPTER: THE REVIEW OF THE COMPUTATIONAL INTELLIGENCE METHODS WHEN APPLIED TO THE DYNAMIC SYSTEMS – the chapter contains the analysis of the variety of methods from the field of computational intelligence in the aspect of their use in the developing MDP apparatus. 3 CHAPTER: THE DEVELOPEMENT OF THE DATA MINING BASED SYSTEM FOR THE DYNAMIC SYSTEM’S MDP MODEL BUILDING – the central chapter is devoted 7

to the development of an MDP-based method able to generate a model based on data represented by multidimensional time series. 4 CHAPTER: THE APPLICATION OF AN MDP MODEL BUILDING USING METHODS OF DATA MINING IN THE PROBLEM OF DYNAMIC PRICING POLICY – in this chapter the task of Dynamic Pricing Policy in the context of Markov Decision Process for performance of experiments is formulated. 5 CHAPTER: THE PERFORMANCE OF EXPERIMENTS REGARDING THE MODEL BUILDING IN THE PROBLEM OF DYNAMIC PRICING POLICY – the experiments aimed at getting the numerical results of the efficiency of a new method are described in the chapter. The description of the developed software is presented. 6 CHAPTER: THE ANALYSIS OF THE RESULTS AND CONCLUSIONS – the final chapter is devoted to the analysis of the findings. The directions of further researches are defined. APPENDIX – includes the structures of intermediate data, fragments of MDP model in XML format and algorithms used in the research.

8

SUMMARY OF THESIS CHAPTERS Chapter 1 (Multistep process of decision making in the dynamic systems) The first chapter deals with the methods of solving the dynamic tasks, whose solution principle consists in the consequent performance of operations aimed to achieve the results. The Dynamic Programming, being a fundamental apparatus, underlies in the basis of Markov Decision Process system, which, in its turn causes its own family of methods. However, it appears that there exist certain difficulties as concerns the use of such methods in the modern economic applications. The main problem consists in the development of the model, which is the environment for functioning of multistep decision making methods. For the purpose of solving the problem of model’s development, the possibility of using the regularities mining procedure for application in dynamic tasks is being investigated. The optimal decision making in the modern real problems of management may not be considered in aspect of gaining a short-time or incidental profit. Thus, the efficient management in economic application implies the achievement of total maximum value of the observed parameter (for example, profit) within limited or unlimited number of phases. Thereby, a class of tasks is existing, whose solution is achieved not at once but gradually, step by step. In other words, the decision making is considered not as a single act, but as a process consisting of many phases [76]. The Dynamic Programming (DP) is a mathematic apparatus, enabling to perform the optimal planning of multistep controlled processes, and processes that depend on time [78]. The planning is the finite sequence of the decisions made. The fundamental property of DP is that, the decisions under development are not isolated from each other [40], but they are coordinated with each other in order to achieve the goal state. The Markov Decision Process (MDP) [17, 36, 54] expands the Markov chain, putting in the concept of controlling influence. The probability of transition from one state to another is defined by the probability under condition of chosen influence or decision. Figure 1. shows the example of transition graph for Markov Decision Process in the classical toy problem of garbage disposal by robot. Each black spot is an action resulted from an appropriate state. The taken action determines the possible further transitions. p = 0,9 r=0

p = 0,1 r=0

p = 0,9 r=0 Wait

Wait

p = 0,1 r = -1

p=1;r=0 Charge Clean p = 0,4 r = -1

Clean p = 0,6 r=1

p = 0,4 r=1

p = 0,6 r=1

For the given model it is necessary to find a garbage disposal policy that would gain maximum encouraging and would not discharge the power supply of a mechanical robot Robot’s states Robot’s actions p – Transition probability r – Reinforcement value

Figure 1. Example of Markov Decision Process transition graph In fact, the matrixes of transition probabilities P and rewards (or reinforcements) R describe the “physics” of the process, proceeding from which an appropriate policy of garbage disposal is calculated. Thus, even at the minor changes of individual matrix values, the resulting policies may contain the principal differences. In this chapter the formal definition of MDP and the structure of its cortege are described by (1). 9

S , A, P, R ,

(1)

where S – the finite set of discrete states, S = {s1, s2, ... , s|S|}. Each state st reflects the current value of vector consisting of observable parameters st = (x1, x2, … x|s|). Thus, the state is all the information available on the dynamic system in a certain moment of time. A – finite set of controlling impacts (actions), A = {a1, a2, ... , a|A|}. Usually the action ai changes immediately one or several parameters xi of state st. P – the function of states transitions determines the probability, that the action a, taken under the state s in the time moment t, will transfer the system to state s' at the time moment t+1. It is a reflection of kind P : S × A × S  P. R – the function of rewards determine the expected reward, obtained immediately after the transition in the state s' from the state s as a result of action a. It is a reflection of kind R: S × A × S  R. In fact, it determines the goal state, which is to be achieved. The solution of MDP is the optimal action policy π* that defines for each state st the appropriate actions ai. In such case, the reflection is π: S → A. The key advantages of MDP are their convergence towards the global optimal policy, as well as simple structure of the model. The explanation of the gained policy of actions π* (actually solution of task) is not complicated in comparison with the solutions of ANN. The policy can be also expressed using different methods of knowledge presentation, for example, decision trees, decision tables [67]. The significant disadvantage of MDP is the absence of mechanisms of the automatic model building. The solution as policy, maximizing the expected discounted sum of rewards, may be gained, if the matrix of transitions P and the function of rewards R, making the base of model, are known. Their construction is rather difficult in the real tasks. On the basis of the study and analysis of MDP and RL models in the Table 1 their main characteristics in the problems of stochastic processes study and building of appropriate model are presented. The standard approach of work with non-Markov systems is the increase of the memory for dealing with the prehistory of transitions. Based on this principle, the approach of state introduction using the time series for creating the Markov model is considered. Table 1 Advantages and disadvantages of models based on MDP Markov Decision Process Advantages

Disadvantages



global convergence;





building the policy taking into account the delay of rewards;

the prior knowledge of system model are needed;



the complexity of method implementing in non-Markov systems.



simple methods of calculating policy.

Reinforcement Learning Advantages

Disadvantages



global convergence;





the system model is not needed (unsupervised learning case);





possibility to work in the systems possessing no properties of Markov;

development of the model through the research is not allowed in the series of practical tasks;



the complexity of application in non-Markov systems.



10

building the policy taking into account the delay of rewards.

Exploitation-Exploration trade-off exists;

One of the disadvantages of MDP is also the complexity of policy search in so-called non-Markov systems – dynamic systems, which do not meet the property of Markov. The development of process in non-Markov systems depends not only on the current state, but also on the sequences of states that took place in the past. The solving of non-Markov problems is possible through the description of the process providing the mechanism of memory. It is also implied, that the static properties in the future are dependent on process evolution character in the past. Such an approach complicates the solution and actually deprives the possibility of its application in real tasks. The review of mathematic apparatus, based on Markov process, enables to make the conclusions regarding the possibility to consider the Markov Decision Process and Reinforcement Learning as effective methods for modelling dynamic systems. The research on contemporary positions of MDP and RL methods detected the key problems of their application in real tasks of economics, management, etc. The analysis of the detected problems ensured the possibility to formulate the approach, based on methods of data elicitation to improve the MDP framework, which enables its use in modern tasks that are described in the setting of non-Markov dynamic system. Chapter 2 (The review of the Computational Intelligence methods when applied to the dynamic systems) In this chapter the substantiation of the necessity of application of the Computational Intelligence (CI) methods with the aim to develop an effective system of decision making is represented. Different architectures of agent systems are analyzed in order to develop an agent architecture that is specific towards the dynamic system [5]. The agent-based approach, in its turn, enables to represent the software system and its interaction with task of the real world that is being solved in a way that is natural for a human. The experiments on model dynamic systems involving Artificial Neuronal Networks with the aim to approximate the spaces of decisions are performed as well. There are several formal definitions of Computational Intelligence. The concept of CI is determined in the work [40] as a set of computational models and tools bearing the intellectual adaptation to immediate perception of primer sensor data, their processing involving parallelization and transferring of task, creating the safe and timely responsive system with high level of resiliency. Usually the immediate processing of “raw” data using the intelligent software instruments is impossible. It is determined by stringent requirements of algorithms towards the data structure. Thus, for instance, MDP models work with the fixed data structure determining the state. The mediator of some kind is thereby needed between the physical data carrier and any intellectual method [5]. This, in its turn, ensures the piping of the task. The sample of the system providing immediate interaction of intellectual tools with the task is represented in Figure 2.

Raw data

Data Preprocessing

Structured data Intelligence tools

Data Base Action

Expert

Decision

The interface Figure 2. The way of immediate interaction 11

The consideration of an indirect impact of the intellectual system on physical source of a task is presented in the Ph.D. thesis as well (in this case – some database), which is typical for tasks that have high expenses of any kind in case of erroneous decisions. Plenty of methods of Computational Intelligence are involved in the developed approach to solving the dynamic systems. The current methods for this research and their position among the methods family of Computational Intelligence [10] are represented in Figure 3. based on the classification attached in [40]. Computational Intelligence

Granular Computing

Neuro-computing

Supervised learning

Unsupervised learning

Evolutionary Computing

Artificial Life

Reinforcement Learning RL, LCS, ….

ANN

Figure 3. A fragment of the Computational Intelligence family tree In this particular work we use the definition of an agent as suggested in the work [28] as a basis: “An autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future”. There are plenty of agent types that meet these definitions either partly or fully. Depending on properties that agents possess, some classes of agents are selected [11]. The most typical among them are: programmable agents (reactive agents, reflexive agents [58]), learning agents and planning agents [37]. The properties that an agent of some class can possess [28] are given in the Table 2. Table 2 The properties of software agents Property

Description

Reactive (sensing and acting)

responds in a timely fashion to changes in the environment

Autonomous

exercises control over its own actions

goal-oriented

does not simply act in response to the environment

temporally continuous

is a continuously running process

communicative

communicates with other agents, perhaps including people

learning

changes its behaviour based on its previous experience

mobile

able to transport itself from one machine to another

flexible

actions are not scripted

character

believable "personality" and emotional state.

12

One of the disadvantages of Reinforcement Learning is the exponential growth of problem space with each new dimension [58, 62]. Further, the most common methods of dealing with the problem (known as “the curse of dimensionality”) are considered. Two main approaches to the working with a large number of states are examined in the chapter: the approximation of value function and the methods of gradient policy. One of the methods that belong to nonlinear approximation is the framework of Artificial Neuronal Networks (ANN). With the aim to analyse the ANN in the aspect of approximating functions, the multilayer perceptron with learning using the error back propagation method is realized in this thesis. The existing commercial and free distributed software implementations of Artificial Neuronal Networks are reviewed too. The plan of experiments includes the following tasks: 1. to perform author’s own implementation of ANN. To research the efficiency of the network function on the example of approximation of some one-dimensional stochastic process; 2. to compare the gained results of approximation with the results of existing ANN packages; 3. to realize the approximation of states space in the Reinforcement Learning, using toy problem for demonstration purposes. Within the framework of the first experiment realized in the Ph.D. thesis, the Artificial Neural Network demonstrates high results of learning. The application of three hidden layers and 70 neurons in each layer ensures the value of mean-square error ems = 0,0013 (see Figure 4. ). Such level of error ensures sufficient precision for modelling of one-dimensional stochastic process consisting of 30 observations. Y

ems = 0,0159 t Figure 4. Function approximated by ANN having three hidden layers The comparison of the results with the two most common packages of ANN (Neurosolutions 6.0 and Multiple Back-Propagation v.2.2.2) was performed within the framework of the second experiment. Neurosolutions 6.0 software for the simplest architecture of network approximates with the mean-square error ems = 0,00943. The package Multiple Back-Propagation ensures the convergence by the value of error ems = 0.0012. The comparison allows us to conclude, that the gained precision meets the precision of side packages. This allows us to use our own realization of a neural network in the subsequent experiments. Rough RL tabular Q-function

Approximation by ANN approximated Q-function

Final learning RL + ANN

Figure 5. 3-step algorithm of approximated model building 13

The idea of approximation of Q-function in Reinforcement Learning using Artificial Neuronal Network is realized in the third experiment. The key problems of method realization are represented; to overcome them, a new ANN learning approach (embedded in RL) is suggested in the Ph.D. thesis (see Figure 5). To test the approach we use the toy problem of mountain car [9]. According to the problem, the attractive force exceeds the engine power. It is impossible to ‘climb up’ immediately from the state of rest. The only solution is to develop the strategy of rolling from one prone to the other, in order to collect the supplementary inertia force. The problem demonstrates the necessity of multiple movement away from the peak and back, to achieve it in the future. Available actions: inactivity of engine (0), acceleration forward (+1) and acceleration back (-1). The optimal policy of toy problem is obtaining by using discrete RL with table Q-function. The surface of Q-function in the space of state is represented in Figure 6. The measurement of operation space is omitted, but the value of optimal action in each point of space is used.

value -0.07

Q* -1.2

Speed position 0.07

Scaling factor of Q-axial: 0.1 Space size: 70 x 80 Episodes count: ≈ 8 000

ε: 0.1 probability of random action : 0.99 discount-rate parameter : 0.3 learning rate λ: 0.92 trace-decay parameters

0.6

Figure 6. Example of optimal policy Q* = maxa Qt(s) It is necessary to get a similar surface as a result of a three-step algorithm. The first step is the development of the first approximation (rough policy). It is experimentally established that for reaching these objectives the sufficient discretion of space is the network containing of 20 x 20 cells. The second step stipulates the transmission of intermediate Q-function into ANN. Now the precision of the function depends on the “capacity” of the network. On the basis of series of experiments, it is identified, that 6 hidden layers containing 110 neurons in each are sufficient to gain the surface of Q-function [9]. Approaching of network coefficient changes to zero (∆eij→0) allows us to move on to the third step – learning of network in the mode of interaction with the environment through RL framework. The experiments demonstrate that after ten iteration of learning, the network “forgets” the prior learned examples, provided that they were not supplied for training continuously together with the trained space of training examples. Taking into consideration the prior steps, the crude policy is a matrix of reference points. Such matrix supports the “memory” of neuronal network, not allowing to forget the reaction on state rarely found during the learning in the environment [9]. As a result of learning, the surface of Q-function, demonstrated in Figure 7. is gained on the third step. The gained policy is looks substantially smoother than its tabular analogue. On the one hand, it reduces the precision, but, on the other hand, it allows to work in the environment with continuous parameters. The experiments with toy problem showed that the algorithm allows to avoid the problems of absence of initial training set and permits the functioning in continuous environments [9]. However, the time of learning increases. 14

-0.07 Q -1.2

Speed Position 0.07 0.6

Figure 7. Q-function for action “throttle”, obtained on third step It is possible to conclude, that ANN learning by error back propagation method provides a powerful tool for approximation of linear functions set in a tabular form [34]. The performed experiments on the toy problem demonstrate that this property is successfully used for approximation of Q-function in the Reinforcement Learning algorithm. The consideration of the concepts of the Computational Intelligence and the appropriate methods, described in this chapter, allows us to draw the following conclusions: 1. the analysis of some concepts of Computational Intelligence, as well as its individual methods (ANN, RL, agent approach, methods of data mining and others), allowed us to describe the main features of the architecture under development (conveyer organization of flow aimed at the transformation of data, timely respond to events, etc), and also the applied methods of intellectual computing; 2. the review of expression of Markov Decision Process through the software agents demonstrates a number of sufficient methods and a wide range of tasks being solved; at the same time, problems linked to conveyorization (pipelining) of the problem, require the development of a special form architecture of agent system; 3. in terms of principle of methodsٰ synergy, the Artificial Neuronal Networks are considered in this work as an approach to approximation of state space of Markov Decision Process. Chapter 3 (The development of the Data Mining based system for the dynamic system’s MDP model building) In the third chapter of the research the development and the description of a mathematic base of intellectual system in the task of decision making, whose data are expressed using the multidimensional time series structures, is provided. The decision is based on finding the hidden regularities between the so-called families of time series. The series of technologies Data Mining are used for that purpose. The concepts of classes and profiles of observed variables dynamics behaviour are implemented for storing and operative processing of mined data. The profiles of behaviour are interpreted as single-step Markov transitions. Markov transitions are used for model creation of process under investigation. Within the framework of this chapter, it is necessary to perform the following (in order to solve the set task):

15

1. a review of current situation in the field of application of the method to the tasks, expressed through the multidimensional time series, as well as the research on the possible existing approaches; 2. the development of data mining based methods to build the time series, their processing and conversion into the structures of data (states of the system), compatible with Markov Decision Process. The research and development of Markov model building method for the data, expressed through the time series, provides the solution to the following problems:  cleaning of observed process raw data, construction of the time series;  formalization of time series as an environment needed for building of statesٰ space;  composition of a transition graph reflecting the generalized behaviour of dynamics of observed variable values;  development of optimal policies and their use. Thereby, the approach is based on generalization of individual observations, building of state space and the further development of the model. In general, the functioning of system based on methods being developed has to contain the following steps:  “acquire” the existing observations regarding the development of process under investigation: o to identify the regularities of transitions in one or other state with appropriate influence; o to build the states transition graph, representing the general model of process being researched;  to build the particular realization of Markov Decision Process for the sought solution at the current parameters on the basis of model;  to develop the policy of actions  for the particular realization of MDP;  to invite the expert to perform the action according to developed policy  at the given parameters;  to accomplish the permanent renewal of environment, transition graph and policy upon acquiring of new data concerning the actual process progress. The data regarding the development of the method of building the MDP model are provided in the chapter. The method of data transformation into structure of Markov Decision Process expressed through time series is suggested and described. For that purpose, the concept of behaviour profile of observed variable values is implemented. The criteria of comparison of time series are reviewed. Description of the approach. One of the directions of Markov Decision Process models using is a research on the dependence of dynamic of one variable on the others. The final objective is the application of a gained model for the timely decision making concerning the performed activity, with the aim to achieve the desired indicators of dynamic system. The main problem, as with most mathematic models, is its building. In the same way [33], the whole set of observations is considered as a source of behaviour patterns (profiles), but, unlike [33], in this work model is based not on the particular realization of time series, but on many realizations, corresponding with different combinations of parameter values of time series. Clustering of realizations of time series allows us to consider new operations with transition models: minimum needed supplement of transition mode using fragments of other analogical models. This operation allows us to continue building of the existing transition model in such a way that with a certain probability the model will allow the system to move into a state that was not stipulated during the learning phase (model building). In general terms, the building of a dynamic system model presupposes the investigation of the particular regularities that took place in the development of processes under observation, their generalization and expression in some structure [4,7] The method of dynamic system model building being developed follows the described approach. The mathematic framework of Markov Decision Process is considered as a model. The 16

state transition graph is used as a graphical expression of the model. The method of model building (see Figure 8) is divided into the following main stages [6]: 1) to process the raw data, to build the time series T; 2) to identify the general regularities of the development of processes, to build the behaviour profile П of observed variables; 3) to find general transitions in the profiles of dynamic behaviour, to build the model of processes evolution (transition graph). p1 s1

p3 P

Data base

s2

p5 s3

p2 s4

p4

Observations

Behaviour profiles

Time series

p6

Transition graph

Figure 8. The main stages of building of the Markov transition models The key moment of time series transformation into MDP structure is the interpretation of the dynamic behaviour profile in the aspect of state matrix forming S [6]. In case of timely consideration of Markov Decision Process the concept of state S is connected with the theory of software intellectual agents [62]. All the information available to an agent, gained by agent sensors from the environment at the certain time moment, is called a state. In case of a task, whose data are represented by time series, the special approach concerning the determination of dynamic state systems is offered [6]. The key difference is the consideration not of the static variable values, but of appropriate time series, in other words – the evolution dynamic of variables being researched is considered. The following interpretation of dynamic behaviour profile meets these aims. Let the behaviour profile, gained as a result of time series clustering, consist of two centroids: v1 and v2 (see Figure 9). a – action

s0 - Initial state, t = [1; ta)

s1 - transition state, t = [ta; tmax] v1 v2

0

ta

t

Figure 9. Sample of behaviour profile of volumes and sale prices Thus, we indicate the structure in profile, including three elements: a) dynamic of variables v1 and v2 before event ei, b) event ei, happening in the time moment ta, c) dynamic of variables v1 and v2 after event ei. Thus the area of variables evolution observed before the event we interpret as the initial state s0S (see. Figure 9, area t = [1; ta)). Event ej – as the action aA. The area of variables evolution observed after the event we interpret as the transition state s1S (sphere t=[ta; tmax]). The dynamic behaviour profile П, then, can be considered as a single determined Markov transition process (see Figure 10) [6]. Taking into account that the transition is built on the actual observations, the values of transition probability p(s1|s0, a0), for the time being, is equal to one. The states of set S are marked with white circles, with grey circle – action a0, as a result of which the determined transition happens. After the appropriation of corresponding state 17

identification to each fragment of each profile, it becomes possible to work not with time series, but with corresponding states, which is a necessary condition for working with Markov Decision Process. a0 s0

s1

Figure 10. Interpretation of sales profile Description of the goal state and the problem. Defining of the goal state on the stage of states set building is not topical. However, as soon as the model is built, the initial and the goal state should be defined for building an action policy. Accordingly, we describe in which way the goal state of Markov Decision Process is interpreted in the problem, expressed through time series. The goal state is the appropriate proportion of the observed variables vi, meeting the requirements of an expert. For example, we consider the observed variable in the problem of car sales v1 – volume of sales and the v2 – sales charge. Then, for some fixed parameters of space Ψ, a state that can be considered as a goal state is the one where time series of sale volumes v1 and prices of sale v2 meets the given value. The action A and reward R matrices. Creating the model of the researched system, the action matrix may be gained using several methods, for example: 1) set of actions A contains only actions, being presented in set of profiles П; 2) set of actions A contains all the allowed values of change of the controlled variable vi in some diapason, irrespective of presence of the specific meaning in the profiles П. Analogically to goal states, without involving the method of reward matrix R building, let us consider its definition for a case of problem expressed through time series. The reward matrix R is commensurate with the quantity of states, and for each state keeps the value of reward that is gained by a system (or an agent) in case of achievement of current state si. For example, let the reward make the value of 1, if the achieved goal state sgoal, and value -0,04 – in the opposite case. The matrix R, then, can be presented by the function:

 1.0, si  S goal rsi  R( si )    0.04, si  S goal

(2)

In such a way, to provide the system with matrix of reward R, it is enough to determine set of target states Sgoal. Transition probabilities graph P. A central element of Markov Decision Process is a transition graph, where it is necessary to find the optimal action policy *. The building of the transition graph is the most complicated thing concerning the expression of the task being researched in the aspect of MDP. In case of many economic tasks the transition model cannot be expressed in the analytical way and be correct for all its states, but it can be obtained in the form of a table based on generalization of actual transitions. The table representation makes it possible to describe the reflection of the state for each state-action combination. Thus, the transition graph is built for the whole space Ψ (based on the definition of state). The transition graph is built in the process of generalization of atomic transitions (dynamic behaviour profiles П). The generalization is the calculation of probability of transition from the state s performing the activity a into the state s’, and it is based on the number of actual observations concerning this transition (3). Since the environment is totally observable (training set is available), it is reasonable to use the statistic approach to evaluate the unknown parameter for the calculation of transition matrix P. We consider one of the simplest approaches in this 18

research - the maximum likelihood estimation. Thus, the calculation of a factual observation quantity of each transition in relation to a total number of transitions from the considered state is expressed as follows:

P ( s ' | s, a ) 

N ( s , a, s ' ) ,  N ( s , a, s ' ' )

 P ( s, a )  1 ,

(3)

s ''

where s' – goal state, s’’ – any state of set S, into which the transition is possible (it means that there exist appropriate factual observations) from the state s performing the action a. N(s,a,s’) - number of factual transitions from the state s into state s', performing the action a;  N (s, a, s' ' ) – the total number of transitions to any possible states from the state s, performing s ''

the action a. The development of clustering procedure. The procedures of clustering are the central generalizing mechanisms of the actual observations in the process of MDP-model building. The precision of prognosis and “adequacy” of a future model to learning data depends on the correct choice of clustering criteria in relation to clustered data and on the accepted parameter values of clustering. In this research, the agglomerative hierarchical clustering of time series is applied, which requires the forming of symmetric distance matrix. Since the objects of clustering are time series, first of all it is necessary to define metrics that allow us to compare quantitatively the similarities of one time series with the other. To calculate the distance, we use two simple calculation criteria: Euclidian distance and the shape-based criteria. The last one allows us to compare not the absolute values, but the shapes of curves of appropriate time series (4).

S ( a, b ) 

N 1

|  i 1

a,i

 b,i |,  a ,i  ai  ai 1 , b, i  bi  bi 1 .

(4)

Euclidian distance makes it possible to group the time series close to each other by distance, but the shape-based criteria are groups after outline (profile) of curves (see Figure 11). Difference of change

Euclidian distance y

Cluster А

y

Cluster B

t

Cluster А

Cluster B

t

Figure 11. Two criteria for evaluation of time series proximity This kind of approach allows us to gain clusters of time series, close by its shape, in groups, homogeneous regarding the scale of distribution. The clustering for time series of sale prices is made only according to shape-based criteria. MDP-model building method representation in a form of an intelligent system. The basis of dynamic programming methods is the principle of consequent interim decisions making leading to the objective. The technique of decision search gains the iteration features. As a consequence, it becomes more preferably to realize one or the other method in the form of a programming tool. It is necessary to develop the architecture of programming tool, which will predetermine the 19

structure of intelligent system. The creation of a system ensures the interaction among the methods of dynamic programming represented below, as well as the interaction with the database (data source of problem) and with an expert (operator).

Environment Agent

Time series creation

Observations Parameters

Client Database

Time series Clustering

New records

Clusters

Decision Behaviour profiles creation

П

Profiles An expert or operator Model Building

Queries MDPmodel

Module of recommendations

Recommendation

Figure 12. Diagram of intellectual system functioning and its interaction with the environment Notwithstanding of it, the architecture of intelligent system has to ensure the interaction with the user database, expert and operator. The architecture of a system meeting all the demands mentioned below is given in Figure 12. The elements of the intellectual system located in the sphere are marked with double outline, including the modules of processing, the storage of time series and the profile storage. The intellectual system possesses the visual interface of interaction with manager (operators) and programming interface of interaction with the user database. The directions of data streams are marked with the dotted line, with solid line - control stream. The system being designed possesses characteristic properties such as autonomous and interrupted functioning, adaptation to changing data (in other words – learning of some kind), goal-oriented (presence of a goal to optimize some parameters of the process being researched by means of model building), communicativeness (interaction with the users and the database), and others. It allows us to consider the software system as a software intellectual agent. Chapter 4 (The application of MDP model building using methods of Data Mining in the problem of Dynamic Pricing Policy) The practice of sales administration points out the necessity of Dynamic Pricing Policy to increase the competitiveness. The Markov models are effective when the decision making includes the uncertainty in the event chronicle, but crucial events can take place repeatedly. The aim of this chapter is to demonstrate, using the test example, the application possibility of Markov decision Process in the problem solving of Dynamic Pricing Policy. Respectively, the tasks of this chapter are the following: 20

to formulate the problem of Dynamic Price Policy as the task of dynamic programming;  to formulate and to describe the method of dynamic control of pricing policy based on Markov Decision Process;  to formulate the method of MDP-model building on the basis of regularities detected using Data Mining tools, from the factual data concerning selling. The basic reasons of Dynamic Pricing Policy choice as an experimental problem are:  time-dimension presence, determining the possibility to represent the selling data as time series;  possibility of generalization according to the number of observed variables (for example, wholesale customer, goods and other) with the aim to build the model;  the possibility to consider the price correction process as a system being at every moment of time in a certain state, possessing the controlling mechanism (change of state) and having the conception of a goal state. The aforementioned reasons are important because they allow us to demonstrate the features of an application approach of Markov Decision Process being developed in the Dynamic Pricing Policy problem. Finally, the topicality of Dynamic Pricing Policy, dictated by rapid evolution of internet technologies in modern business, also determines the choice of this task. A source of information concerning sales, database and the method development platform in the problem under consideration is the enterprise resource planning 1С:Enterprise v7 system. The task of Dynamic Pricing Policy has several definitions. Here we consider the following definition: the dynamic pricing is the operative adjustment of prices to customers depending upon the value, with which the customers correlate with the production or service [49] (the definition, in its turn, is based on [56]). As a value, with which the customers correlate with the production or service, we consider three forms of price differentiation offered in this work [69]. A system of decision making is developed; it aims at the long term maximization of sale sums, which is achieved through the correction of existing sale prices taking into account the available factors of ERP- system (see Figure 13.) and the goal state set by an expert. 

   

Available factors, determining the initial price

Goods Labor expenses Raw material Profit

PRICE GENERATOR

Basic tools of ERP

Standard subsystem of pricing

Implementing one time

price Available factors and model, built using data mining

The price correction block being researched and developed

  

Wholesale customer Sales date Contract conditions

  

Current price Price model Goal price state

PRICE CORRECTION

Integrated intellectual system

Implementing regularly

price Figure 13. Interaction of price generator module and module of price correction 21

The mechanism of price correction is realized owing to the use of MDP-model. The building of price correction model includes the finding of regularities of sale evolution process in the past and their generalization (see Chapter 3). In other words, the solution is reduced to the analysis of changes in the past and creation of an appropriate model of price evolution process. The general objective of Markov controlled processes with discounting of incomes is the choice of such system managing vector everywhere, to gain the maximum profit on the horizon of its functioning [80]. Due to this property, the MDP framework is appropriate in the Dynamic Price Policy problem. Processing the multidimensional space data is originally stipulated in this work, including the data on wholesale buyers and product names. Each dimension of space is determined in the space of some hundred values. The methods of Data Mining are applied in order to discover the regularities that describe the outcomes of pricing policy actions in the past. A program tool (based on the intellectual agent) for the online tracking of the new data on sales (which comes from the managers and operators of the ERP-system), is realized. Such tracking makes it possible to update the model timely by means of inserting a new data on sales and outcomes of the price corrections in it. Data model. The problem of price correction is a discrete process, in other words, the behaviour of a wholesale customer-goods system can be expressed with the final number of states. At any discrete time moment ti system is in one of the possible states sj  S. The sales process is being observed over time for fixed values of appropriate parameters (measurements of space). Each state of the system is determined using two vector values (time series):  p - time series of price, representing dynamics of price changes;  v - time series of sales volume, representing the dynamics of sales volumes. There is a sample system (on the right) in Figure 14, having fixed values “Light” to measure Buyer and “Led” to measure Goods. The sale price pij and sale volume vij are the observed values in the framework of the system. Accordingly, the measurements of space are the observed variables Customer, Goods and Time. Each point of a hypercube (Figure 14, left) is determined by two static values: price and sales volumes. Time

Goods “Home” “Centre” “Light”

“Home” “Centre” “Light” v=3 Led

v=5

t1

t2

t3

t4

t5

t6

V,P v

v=7

p

p=4.75 p=4.65 p=4.85 v=20

Tumbler

v=20

t

v=25

Sales of goods ‘Led’ to buyer “Light”

p=0.30 p=0.30 p=0.29 v=15

Socket 220V

v=20

v=20

p=1.33 p=1.33 p=1.30 Customer

Figure 14. Data hypercube (on the left) and time series of the state (on the right) Apart from the dimensions ‘Goods’ and ‘Customer’, other measurements also exist. However, the other measurements are abolished in this study for the sake of clarity. Let us also

22

suppose that the data have passed all the stages of cleaning and pre-processing. The description of procedures of data cleaning is represented in [4]. The price correction problem in terms of dynamic programming. The provision is nominated to the task in the context of dynamic programming, which means that its solution has to be represented as some sequence of actions. In other words, the solution of the subtasks stated. The price correction problem possesses such property. For example, price p regarding some goods Gi may be sequentially (during some period of time) transferred into a desired value without significant (given value) loss of sales value. At the same time, the immediate change of price can cause the loss of wholesale customers. Multistep approach of price correction makes the using of Dynamic Programming appropriate in the Dynamic Pricing Policy problem. The price correction problem in MDP terms. We express the problem described above through recursive Bellman optimality equation. In the terms of MDP, the equation below expresses the value of an expected reward gained for transition from the current state s in the state s’ according to a certain policy π [62]:

V  ( s)  R( s)    P( s' | s,  ( s))V  ( s' ) .

(5)

s'

The expression of optimal policy is known as the Bellman optimality equation in the terms of MDP. It describes the current reward for taking action, entailing in the future maximal expected reward [62]:

V * ( s)  R(s)  max   P( s'| s, a)V * (s' ) , a

(6)

s'

where s – state of system, determined by vectors p and v : s = { p ; v }; R(s) – reinforcement, gained in current state s. The goal state, like any state, is determined through its own values of reinforcement. In the general case, the function of reward determines how good or bad it is to stay in the current state (similar to “pleasure” or “pain” in biological context). The reinforcement represents, in the case of price correction, the local amount gained from wholesale customer, and it is determined for state s in the following way: |s|

R( s)   ( p  v ),

(7)

 1

where |s| - size of time series contained in the state s; then the reward is the amount of incomes for all the days observed in the frames of state s;  - time variable of the researched time series. Let us continue the consideration of equation variables (6): P(s’|s,a) - probability of system transition to the state s’ from the state s, performing an action a. The calculation of probability matrix P is based on the quantity calculation of factual observations of each transition regarding the total quantity of transitions from the state being considered (3). Since the function V*() in the expression (6) is both present in left and right part, the calculation is performed in a recurrent way, i.e. by means of decomposition of the whole task into subtasks. There exist two main algorithms of an equation solving (5): Value Iteration and Policy Iteration. The Dynamic Pricing Policy problem in the terms of Markov Decision Process is formulated in this chapter. The method of model building is offered. The obtained model is considered as an environment where MDP is functioning. The method is based on finding of regularities and their generalization. The methodology of model building is demonstrated in the framework of elementary system including one wholesale customer and one goods unit. The analysis of solved problems allows us to formulate the following conclusions: 1. A dynamic system whose development is represented through the time series, can be expressed through a final number of states and actions, can have the transition model and reward function; 23

2. decision making and support module interaction with the system wholesale customer goods, as well as the necessity in the continuous update of the model, determine the application of an agent oriented approach. Chapter 5 (The performance of experiments regarding the model building in the problem of dynamic pricing policy) To approbate the suggested model building method, the development of a software platform for carrying out the experiments is performed, the plan of experiments is developed, and the process of their implementation is described. Software development for the experiments execution. The program modules for performing experiments are realized on the platform 1С:Enterprise version 7.7. The choice is determined by the existence (for this platform) of data on purchases of products by real enterprise for two years. The realized modules are also able to serve for creation of final software designed to operate in the background mode and to perform the decision making concerning the price correction. Besides, the modules for carrying out the experiments, which are not associated directly with subject sphere, are realized. These are the programs for working with Artificial Neuronal Network and Markov Decision Process in the toy problems. Involvement of the development environment Borland Delphi 7 has to ensure and accelerate the processing of massive data whose volume exceeds the possibilities of platform 1С:Enterprise v7.7. For example, the creation of distance matrix is restricted up to 5000 elements in each dimension. Plan of the experiments. To evaluate the workability and the efficiency of the developed method concerning the building of MDP-model, the plan of experiments is developed (see Table 3). Two series of experiments are included in the plan. By means of comparing the model and the actual processes development, the first series of experiments allows us to evaluate how MDP-model meets the learning data. The aim of the second series of experiments is to research the quality of MDP-model created through the approximation of space by Artificial Neuronal Network. Table 3 The plan of experiments concerning the building and application of MDP-model in Dynamic Pricing Policy Problem Series Nr

I.

II.

Description

The experiment aim

Using the numerical characteristics, to evaluate the similarity of a built MDP-model with respect to the factual processes

To evaluate the correctness of algorithms concerning the building of MDP- model

Using the numerical characteristics, to compare the efficiency of solution concerning MDP-model with factual solutions for testing data

To evaluate the efficiency of MDP-model in the exploitation mode

Using the numerical characteristics, to perform the estimation of similarity concerning built MDP + ANN model towards factual processes

To evaluate the correctness of algorithms of MDP-model building with approximated space

As a criteria for evaluation of model quality serves the proportion of the number of successfully modelled transitions to the number of actual transitions, as well as the evaluation of profit expressed in conventional units. 24

The pricing policy model building and exploitation. The actual data regarding the sale of products produced by Latvian food industry are used for creation of pricing policy model in this experiment. The period of observed data covers three months: May, June, and July. The data about sales are represented via electronic documents of ERP-system „1C:Enterprise v7.7”. In addition, each electronic document contains the date of deal, name of wholesale customer, register of goods, sales volume, and prices. Approximately 28,5 thousand sale documents, 1 725 wholesale customers and 1 725 names of marketable titles are present in the frames of period being researched. The obtained model of transitions (see Table 4) represents the Markov Decision Process. The total number of transitions consisted of 1192, created by 294 states. Table 4 A transition model fragment Initial state s

Action a

Cl_3

~ p2  (0.05,  0.0009] ~ p2  (0.05,  0.0009] ~ p  (0.1,  0.05]

Id_14

~ p1  (0.1,  0.05]

Id_7 Cl_66

1

Cl_3

~ p3  (0.0, 0.01] ~ p4  (0.01, 0.015] ~ p  (0.01, 0.015]

Id_24

~ p5  (0.015, 0.02]

Id_16 Cl_66



4



Transition state s’

Transition probability P ( s ' | s, a )

Cl_11

1,0

Cl_54

0,25

Id_11

0,667

Id_15

0,4

Id_17

0,667

Cl_54

0,333

Id_23

1,0

Id_25

0,75





Using the graphical package «yEd», the visual representation of model is possible. Based on XML-file of model generated by experimental platform in Figure 15., the fragment of model is given below. The light ovals designate the states, dark ones - actions, and values over the slats – the corresponding transition probability. The gained graph represents the pricing model for the appropriate time period. The fragment of sales volume development for certain observed combination of wholesale customer and goods is given in lower side of Figure 15. The states not accompanied by any price changes are outlined with a dotted line (in other words, the transitions with zero value of price). The transitions caused by a certain price changes are outlined with a dashed line. The straight arrows show the position of each graph element on real piece of sales process. In Figure 16. the number (in percentage for total number 7000) of time series, possessing the appropriate value of numerical estimation X (C ;G ) (describes the correspondence of model and its separate parts to actual transitions; for details see paragraph (5.2) in Ph.D. Thesis). We note the absence of marks with value less than 0.5 – it means that all time series at least are half described by model and have partial sequence of events. The low mark (less than 0.9) is associated in 97% of cases with the discontinuity of the model (this effect can be observed in Figure 15., above for single transitions). Nevertheless, 70% of observations possess the evaluation of X (C ;G ) =0.8, which characterizes the model as a model that is able to reproduce the majority of processes.

25

Price

Volume

Date 31.07.2008

01.05.2008

Figure 15. The fragment of transition graph (above); dynamic of sale volume for fixed combination wholesale customer-goods Having the transition model, it becomes possible to use the well studied algorithms of policy search in Markov Decision Process (Policy iteration, Value Iteration), for which it is necessary to determine the goal state. Generally, the exploitation of model for the process being researched includes the following stages: (a) to compare the current state of process with one of the model states, which would determine the initial state s0; (b) to determine the goal state starget; (c) to build the policy * of price corrections; (d) to track the current state and update the policy during the process evolution. %

30 25 20 15 10 5

X (C ;G )

0 1

0,889

0,8

0,715 0,667 0,625

0,5

Figure 16. Distribution of value X (C ;G ) The made changes of price come into force, and the system does the transition into the next state st+1 after the specified period. The transition is characterized through a changed value of price and the reaction of request. Such interaction of decision making module and the external system the wholesale customer-goods represents the typical interaction of intellectual system with the environment.

26

The exploitation of gained policy on testing data. The aim of this experiment is the evaluation of the possibility to apply the model to data not included in learning sample. In fact, the experiment reflects the initial task, which is nominated to the intellectual system. Unlike the previous experiment, this experiment stipulates the phase of exploitation of optimal policy *. This means that it becomes possible to evaluate the efficiency of model using numerical method in the aspect of profit. Building policy * all available states are considered as goal states. The strategy is calculated, based on gained profit (7) and appropriate actions. It allows operating to all variants of process evolution and provides a choice of the optimal one in terms of discounted profit. The most demonstrative case of developed policy exploitation for testing data is the example, given in Figure 17. According to MDP-model, by transition from the state Cl_438 into Cl_636, two alternatives appear: to make actions concerning corrections of price into value ~ p 27  (0.15, 0.1625] , or action concerning the price correction into value ~ p24  (0.1125, 0.125] . Depending on a chosen action, the terminal states can be the states Id_2181, Cl_683, Id_734. Since the local reward values are known (the total income at the site of state), the calculation of terminal state achievement policies and a choice of the optimal one becomes possible. Cl_704

0_to_0

Id_628

Id_3337

0.5

Id_158

0_to_0

Id_2181

0_to_0

r=201 1

0.5 0.8

Cl_88

1

0.25

Cl_572

r=153

0_to_0

Cl_438

0_to_0

Cl_636

0.75

r=195 0,15_to_0,1625

1

Cl_685

0_to_0

Cl_683

0.2

1

* 0,065_to_0,07

1

0.1125_to_0.125

Id_627 0.75

r=218

r=178

1 0.25 1

Cl_374

0,035_to_0,04

Id_84

0_to_0

Cl_430

0_to_0

Id_734 1

r=247

r=113

Sale volume

Sale price Data Figure 17. The fragment of transition graph and optimal policy * (above); dynamic of sale values and prices for a particular combination wholesale customer-goods Despite the fact that the terminal states Cl_683 does not possess the maximum value of local reward, it is the most “attractive” one in terms of discounted reward constituting the value p24 ) = 405,585. The maximal discounted reward determines the optimal politics *, V(Cl_683 | ~ represented in Figure 17. with light grey bold arrow.

27

In the end, the precision depends on which general features of sales process evolution are created on the base of individual observed cases building MDP-model. With the aim to evaluate the agent functioning effectiveness in the framework of 5000 combinations in Figure 18 the histograms reflecting the distribution of combinations according to appropriate value of comparison evaluations of process being modelled with actual process, are given. According to Figure 18 it is possible to conclude that the model contains predominantly the combinations (77,8%) for which the correction of prices lead to the positive results (i.e. the profit values are more than zero). The negative value of evaluation  means that the price corrections, offered by the model, turned out to be less effective in comparison with the solutions provided by the expert.

 ,%

∆D

Figure 18. Number of combinations customer-goods distribution according to profit (left), number of combinations customer-goods distribution according to distance (right) The presence of combinations customer-goods, for which the price corrections cause losses, is explained by insufficient number of individual observations found for such combinations, to create valid transition graph. MDP space approximation by means of ANN experiments. The practical research concerning the approximation possibility of Dynamic Pricing Policy decisions space using Artificial Neuronal Networks is performed in this experiment. The method details and its application in toy problem of mountain car are provided in the subsection 2.3. The approximation of transition space is needed for gaining transition probabilities of states, in which the system was never found before, but has the potential for. In such a case it is advisable to have at least the estimated value of transition possibilities. The curves of convergences for various numbers of hidden layers and neurons in the hidden layer of ANN are given in Figure 19. The configuration “22-66-66-1” can be marked as the “quickest” (after the number of used iterations of learning) network configuration (two hidden layers, 66 neurons in each). The “slowest” is the configuration «22-88-1». If we take into consideration the time spent on one iteration of learning, then we find that the configuration 2222-22-1 is the “quickest”. It allows us to achieve the result comparable with configuration “2266-66-1” for larger number of iterations, but in shorter time. In this connection, we will use the configuration 22-22-22-1 in the further experiments. With the aim to evaluate the quality of approximation we perform the cross validation test. The whole learning sample is divided into 10 blocks for tgat purpose. Each block is made by compilation of records taken from the learning sample through a given interval. The plots of convergence for 10 cross validations are represented in Figure 20. All validations are performed on the network with configuration “22-22-22-1”.

28

RMSE

Epoch of learning Figure 19. Convergence of ANN for different parameters We note that the convergence of test sample has no asymptotic approach to zero (like learning sample), but it is kept on a certain level. At the same time, the value of average square error (RMSE) of test sample has the same scale of values as the error of learning sample. When analyzing the approximation result of one of the cross validation blocks (see Figure 21.), it is possible to note that the network is able to repeat the main traits of test function (RMSE error value is 0,17). 0,3

RMSE rmse

0,25

Convergence of test samples

0,2 0,15 0,1

Convergence of learning samples

0,05 0

0 320 640 960 1280 1600 1920 2240 2560 2880 3200 3520 3840 4160 4480 4800 5120 5440 5760 6080 6400 6720 7040 7360 7680 8000 8320 8640 8960 9280 9600 9920

Epoch of learning

Figure 20. Convergences of cross validations Such error value allows us to use the model approximation with the purpose of building MDP action policy. Now the policy building algorithm uses not the tabular representation of transition probability function, but its approximation performed by ANN. With the aim to evaluate the precision, which could be proportional to evaluations of previous experiments, we use expression (5.2), which is represented in the promotion thesis. This expression allows for each combination wholesale customer - goods (C; G) to calculate the numeric evaluation X (C ;G ) of model accordance (or its separated parts) to factual transitions.

29

Network’ outcome

Transition probability

Desired value

Record number of test sample Figure 21. Fragment of test sample approximation The number (in percentage of total amount 5000) of time series having the appropriate value of numerical estimation X (C ;G ) is represented in Figure 22. In comparison with the results given in Figure 16., the approximated model has smaller precision. The important advantage of the approach is that there occurs a probability to take decision within the states, in which the system has not been before. %

X (C ;G ) Figure 22. Distribution of estimation value X (C ;G ) for approximated model In practice, taking into account such relatively high error, such decision can be offered to the expert to consider the possible strategies concerning the price correction. Chapter 6 (The results analysis and conclusions) It is shown that the dynamic system characterized by the presence of time series can be expressed through final states and actions, have the transition model and reward function. Most attention was concentrated on the building method of the model considered as the environment, where the MDP apparatus functions. The method is based on finding of regularities of the observed variables evolution within the time and the generalization of the detected regularities. The employed agent-oriented architecture ensured such interaction of the MDP-model with the environment, where the permanent learning of the MDP-model and the transformation of the policy in frames of dynamic environment can be performed. The offered experimental platform allowed us to approve the developed method of model building in the circumstances of real task. The platform allows the expert to choose the desirable goal state. At the same time, the goal state

30

can be selected automatically from a subset of states with maximum value of reward or any other terminal state, whose achievement has the maximum discounted income. The method of fully automated price evolution model building is demonstrated in the framework of a real system including hundreds of customers and product names. The validation of the model showed the acceptable precision. To increase the precision, the implementation of the appropriate experiments (search for parameters, revision of data structure) is needed, as well as the improvement of certain algorithms, etc. The main results of this promotion thesis are the following:  the current state of the problem concerning the building of Markov model of dynamic system, represented through time series within multidimensional space of observed variables is researched;  such areas of Computational Intelligence as agent intellectual systems, Data Mining procedures, Artificial Neuronal Networks, etc are researched with the aim to develop the method concerning the building of MDP-model of dynamic system and its applications for the testing data;  the new multistep approach concerning the building of MDP-model in case of approximation of state value table using the Artificial Neuronal Network is developed and approved;  the method of MDP-model building based on regularities searching of observed variables evolution processes and their transformations into the set of states, actions and transition probability function is developed;  the approaches for transformation of real problem data into structure, meeting the MDP framework are offered: the methods and data structure are developed (behaviour profile of observed variables) to transform the multidimensional time series into states of Markov Decision Process, the methods of action set building are suggested, the method of search of goal states is offered;  to approbate the suggested method of MDP-model building, the experimental software platform is developed, as well as the series of accompanying software tools;  In the course of experiments based on real data on sales, the numerical evaluations of an MDP-model closeness to the factual under investigation processes evolution, as well as the evaluation of the agent system functioning on testing data.

31

MAIN RESULTS OF THE THESIS The decision making method based on the MDP and ensuring the MDP-model building in problems which contain the data presented as multidimensional time series was developed as a part of the doctoral work. The method was tested in a series of experiments. As a result, the numerical estimates allowing us to conclude that the method is able to build an MDP-model which adequately displays the learning sample, were obtained. The following tasks have been solved and the results obtained. 1. The review of mathematical methods based on Markov Decision Processes allows us to conclude that MDP and Reinforcement Learning can be considered an effective method for modelling of dynamic systems; some key problems of their use in tasks which contain data as multidimensional time series were defined. 2. The analysis of several computational intelligence techniques (ANN, RL, agent based systems, methods of Data Mining, etc.) allowed to describe the main features of the developed method (pipeline organization to transform the data, timely updates of an MDPmodel and so on). 3. Special agent based architecture was developed to avoid an incorrect description of the interaction of intelligent system with the environment. 4. The approximation of the state space and decision space of MDP with the use of Artificial Neural Network was implemented. The efficiency of the approach was demonstrated with the toy problem. 5. The intermediate structure (the profile of a studied value’s behaviour) for storing and processing of the identified patterns of the investigated time series was formulated. The behaviour profiles of values being researched are used for the MDP-model creation. 6. The approach of using different criteria for clustering time series (Euclidean distance and the shape-based similarity) according to the semantic load of each studied variable was offered. 7. For the purpose of testing the method, the problem of the Dynamic Pricing Policy within the MDP framework was formulated, and the software for implementing the experiments was developed. 8. A series of experiments enabling one to quantify the effectiveness of the MDP-model building was carried out. The assessment was based on the comparison of the resulting model and the training sample, and, also on the application of the model to the data outside the training set.

32

LIST OF REFERENCES 1. 2.

Carkova V., Šadurskis K. Gadījuma procesi. – Rīga : RTU, 2005. – 138 lpp. Čižovs J., Borisovs A. Markov Decision Process in the Problem of Dynamic Pricing Policy // Automatic Control and Computer Sciences. - No. 6, Vol 45. (2011), pp 77-90.  indexed in: SpringerLink, Ulrich's I.P.D., VINITI.

3.

Čižovs J., Borisovs A., Zmanovska T. Ambiguous States Determination in Non-Markovian Environments // RTU zinātniskie raksti. 5. sēr., Datorzinātne. - 36. sēj. (2008), 140.-147. lpp.  indexed in: EBSCO

4.

Chizhov Y., Zmanovska T., Borisov A. Temporal Data Mining for Identifying Customer Behaviour Patterns // Workshop Proceedings Data Mining in Marketing DMM’ 2009, 9th Industrial Conference, ICDM 2009, Leipzig, Germany, 22-24 July, 2009.– IBaI Publishing, 2009. – P. 22-32.  indexed in: DBLP, Io-port.net.

5.

Chizhov Y. An Agent-Based Approach to the Dynamic Price Problem // Proceedings of 5th International KES Symposium Agent and Multi-agent Systems, Agent-Based Optimization KES-AMSTA/ABO'2011, Manchester, U.K., June 29-July 1, 2011.– Heidelberg: SpringerVerlag Berlin, 2011. – P. 446-455.  indexed in: SpringerLink, Scopus, ACM DL, DBLP, Io-Port.

6.

7.

Chizhov Y., Kuleshova G., Borisov A. Manufacturer – Wholesaler System Study Based on Markov Decision Process // Proceedings of 9th International Conference on Application of Fuzzy Systems and Soft Computing, ICAFS 2010, Prague, Czech Republic, August 26-27, 2010.– b-Quadrat Verlag, 2010 .– P. 79-89. Chizhov Y., Kuleshova G., Borisov A. Time series clustering approach for decision support // Polish Journal of Environmental Studies. – Vol.18, N4A (16th International MultiConference ACS-AISBIS, Miedzyzdroje, Poland, 16-18 October, 2009), pp.12-17.  indexed in: Scopus, Web of Science

8.

Chizhov J., Borisov A. Applying Q-Learning To Non-Markovian Environments // Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2009), Porto, Portugal, January 19 - 21, 2009. – INSTICC Press, 2009.- P. 306-311.  indexed in: Engineering Village2, ISI WEB of KNOWLEDGE, SCOPUS, DBLP, Io-port.net.

9.

Chizhov Y. Particulars of Neural Networks applying in Reinforcement Learning // Proceedings of 14th International Conference on Soft Computing „MENDEL 2008”, Czech Republic, Brno, 18.-20. June, 2008. – Brno: BUT, 2008. – P. 154-160.  indexed in: ISI Web of Knowledge, INSPEC

10. Chizhov Y. Reinforcement learning with function approximation: survey and practice experience // Proceedings of International Conference on Modelling of Business, Industrial and Transport Systems, Latvija, Rīga, May 7-10, 2008.- Riga: TSI, 2008.- P. 204-210.  indexed in: ISI Web of Knowledge

11. Chizhov J. Software agent developing: a practical experience, Scientific proceedings of Riga Technical University: RTU 48. rakstu krājums, 5. sērija, 31. sējums, 12 October, 2007, Riga Technical University, Riga, Latvia. 12. Chizhov J., Borisov A. Increasing the effectiveness of reinforcement learning by modifying the procedure of Q-table values update // Proceedings of Fourth International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control „ICSCCW-2007”, Antalya, Turkey, 27-28 August 2007.– b-Quadrat Verlag, 2007.– P. 19-27. 33

13. Chizhov J., Agent Control In World With Non-Markovian Property // Poster presentation in EWSCS’07: Estonian Winter School in Computer Science, Palmse, Estonia, March 4-09, 2007. 14. Athanasiadis I.N., Mitkas P.A. An agent-based intelligent environmental monitoring system // Management of Environmental Quality.–Vol.15 (2004), P. 229-237. 15. Baxter J., Bartlett P.L. Infinite-horizon policy-gradient estimation // Journal of Artificial Intelligence Research. –Vol.15 (2001), P. 319–350. 16. Beitelspacher J., Fager J., Henriques G., …[etc]. Policy Gradient vs. Value Function Approximation: A Reinforcement Learning Shootout. Technical Report No.CS-TR-06-001, School of Computer Science University of Oklahoma Norman, OK 73019, 2006. 17. Bellman R. Dynamic Programming. – New Jersey: Princeton University Press, 1957 18. Bertsekas D.P., Tsitsiklis J. Neuro-Dynamic Programming. - Athena Scientific, 1996. – 512 p. 19. Butz M.V. Rule-based evolutionary online learning systems: learning bounds, classification, and prediction. – Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2004. 20. Carreras M. and other. Application of SONQL for real-time learning of robot behaviours // Robotics and Autonomous systems. – Vol. 55, Issue 8 [2007], P. 628-642. 21. Cervenka R., Trencansky I. AML. The Agent Modeling Language. A comprehensive Approach to Modeling MAS. – Berlin: Springer, 2007. – 355 p. 22. Chakraborty D., Stone P. Online Model Learning in Adversarial Markov Decision Processes // Proceedings of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), Toronto, Canada, May 10-14, 2010. – International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2010. –P. 1583-1584. 23. Cheung T., Okamoto K., Maker F., … [etc]. Markov Decision Process Framework for Optimizing Software on Mobile Phones // Proceedings of the 9th ACM IEEE International conference on Embedded software, EMSOFT 2009, Grenoble, France, October 12-16, 2009. – New York: ACM, 2009. – P. 11-20. 24. Cotofrei P., Stoffel K. Rule extraction from time series databases using classification trees // Proceedings of the 20th IASTED Conference on Applied Informatics, Innsbruck, Austria, February 18-21, 2002. – Calgary, Canada: ACTA Press, 2002. – P. 327-332. 25. Crespo F., Weber R. A methodology for dynamic data mining based on fuzzy clustering // Fuzzy Sets and Systems. – Vol.150 (2005), P. 267-284. 26. Das G., Lin K.-I., Mannila H., … [etc]. Rule discovery from time series // Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, USA, August 27-31, 1998. – NY: AAAI Press, 1998. – P. 16-22. 27. Fager J. Online Policy-Gradient Reinforcement Learning using OLGARB for SpaceWar. // Technical report, University of Oklahoma, 660 Parrington Oval, Norman, OK 73019 USA. 2006. –P.5. 28. Franklin S., Graesser A. Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents // Proceedings of the Workshop on Intelligent Agents III, Agent Theories, Architectures, and Languages, ECAI '96, Hungary, Budabest, August 11-16, 1996. – London: Springer-Verlag, 1997. – P.21-35. 29. Ganzhorn D., de Beaumont W. Learning Algorithms and Quake // Technical report, University of Rochester, March 19th, 2004. –P.13. 34

30. Gearhart C. Genetic Programming as Policy Search in Markov Decision Processes // Genetic Algorithms and Genetic Programming at Stanford.– (2003), P. 61-67. 31. Goto J., Lewis M.E., Puterman M.L. A Markov Decision Process Model for Airline Meal Provisioning // Transportation Science. – Vol. 38, No. 1 (2004), P. 107-118. 32. Guestrin C., Koller D., Gearhart C., … [etc]. Generalizing Plans to New Environments in Relational MDPs // Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003. – San Francisco, USA: Morgan Kaufmann, 2003. – P. 1003-1010. 33. Hassan Md.R., Nath B. Stock market forecasting using hidden Markov model: A new approach // Proceedings of the 5th International Conference on Intelligent Systems Design and Applications, ISDA’05, Wroclaw, Poland, 8-10 September, 2005.– Washington, USA: IEEE Computer Society, 2005. – P. 192-196. 34. Haykin S. Neural Networks and Learning Machines (3rd Edition).– New Jersey: Prentice Hall, 2008.– 936 p. 35. Hewitt C. Viewing Control Structures as Patterns of Passing Messages // Artificial Intelligence. – Vol. 8(3) (1977), P. 323-364. 36. Howard R.A. Dynamic Programming and Markov Processes.– Cambridge, MA: MIT Press, 1960.– 136 p. 37. Jacobs S. Applying ReadyLog to Agent Programming in Interactive Computer Games // Diplomarbeit, Fakultät für Mathematik, Informatik und Naturwissenschaften der RheinischWestfälischen Technischen Hochschule Aachen, 2005. 38. Kampen van N.G. Remarks on Non-Markov Processes // Brazilian Journal of Physics.–Vol. 28, Nr 2 (1998), P. 90-96. 39. Keogh E., Lin J., Truppel W. Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research // Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA November 19-22, 2003.– Melbourne: IEEE Computer Society, 2003.– P. 115-122. 40. Konar A. Computational Intelligence: Principles, Techniques and Applications.– Berlin Heidelberg: Springer-Verlag, 2005.– 732 p. 41. Krzysztof L. Markov Decision Processes in Finance // Master’s Thesis, Department of Mathematics Vrije Universiteit Amsterdam, 2006. 42. Lazaric A., Taylor M.E. Transfer Learning in Reinforcement Learning Domains // Lecture material of European Conference on Machine Laerning and Principles and Practice of Knowledge Discovery in Databases ‘09, Bled, Slovenia, September 7-11, 2009. 43. Lee S.-l., Chun S.-J., Kim D.-H., Lee J.-H., … [etc]. Similarity Search for Multidimensional Data Sequences // Proceedings of IEEE 16th International Conference on Data Engineering, San Diego, USA, 28 February - 3 March, 2000.– IEEE Computer Society, 2000.– P. 599608. 44. Li C., Wang H., Zhang Y. Dynamic Pricing Decision in a Duopolistic Retailing Market // Proceedings of the 6th World Congress on Intelligent Control and Automation, Dalian, China, 21-23 June 2006.– IEEE, 2006.– P. 6993-6997. 45. Lin L-J. Reinforcement learning for Robots Using Neural Networks // PhD thesis, Carnegie Mellon University, Pittsburgh, CMU-CS-93-103, 1993. 46. Lind J. Issues in Agent-Oriented Software Engineering // Agent-Oriented Software Engineering: First International Workshop, AOSE 2000. Lectures Notes in Artificial Intelligence. – Vol. 1957 (2001), P. 45-58.

35

47. Melo F.S., Ribeiro M.I. Coordinated Learning in Multiagent MDPs with Infinite State-Space // Autonomous agents and multi-agent systems.– Vol. 21, Number 3 (2010), P. 321-367. 48. Mitkus S., Trinkūnienė E. Reasoned Decisions In Construction Contracts Evaluation // Baltic Journal on Sustainability. – Vol. 14, Nr3 (2008), P. 402-416. 49. Narahari Y., Raju C., Ravikumar K., … [etc]. Dynamic Pricing Models for Electronic Business. Sadhana // Sadhana. – Vol. 30, Part 2 & 3 (2005), P. 231–256. 50. Palit A.K., Popovic D. Computational Intelligence in Time Series Forecasting. Theory and Engineering Applications. – London: Springer, 2005. – 372 p. 51. Povinelli R.J., Xin F. Temporal pattern identification of time series data using pattern wavelets and genetic algorithms // Artificial Neural Networks in Engineering. – New York: ASME Press, 1998. – P.691-696. 52. Powell W.B. Approximate Dynamic Programming I: Modeling // Encyclopedia of Operations Research and Management Science. – John Wiley and Sons, 2011. – P. 1-11. 53. Pranevičius H., Budnikas G. PLA-Based Formalization Of Business Rules And Their Analysis By Means Of Knowledge-Based Techniques // Baltic Journal on Sustainability. – Vol. 14, Nr 3 (2008), P. 328-343. 54. Puterman M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. – New York: John Wiley & Sons, 1994. – 649 p. 55. Pyeatt L.D., Howe A.E. Decision Tree Function Approximation in Reinforcement Learning // Proceedings of the Third International Symposium on Adaptive Systems: Evolutionary Computation and Probabilistic Graphical Models, Vol. 2, Issue: 1/2 (2001). – Citeseer 2001. – P. 70-77. 56. Reinartz W. Customising Prices in Online Markets // European Business Forum. – Issue 6 (2001).– P. 35-41. 57. Rosu I. The Bellman Principle of Optimality // Pieejas veids: tīmeklis WWW. URL: http://appli8.hec.fr/rosu/research/notes/bellman.pdf. – Resurss aprakstīts 2011.g. 12.dec. 58. Russel S., Norvig P. Artificial Intelligence: A modern approach, 2nd edition.– New Jersey: Prentice Hall, 2003. – 1132 p. 59. Song H., Liu C.-C. Optimal Electricity Supply Bidding by Markov Decision Process // IEEE transactions on power systems. – Vol. 15, Nr. 2 (2000). – P. 618-624. 60. Sonnenberg F.A., Beck J.R. Markov models in medical decision making: a practical guide // Medical Decision Making. – Vol. 13, Nr 4 (1993). – P. 322-338. 61. Sunderejan R., Kumar A.P.N., Badri K.K.N., ... [etc]. Stock market trend prediction using Markov models // Electron. – Vol.1, Issue 1 (2009). – P. 285-289. 62. Sutton R.S., Barto A.R. Reinforcement learning. An Introduction. – Cambridge, MA: MIT Press, 1998. – 342 p. 63. Sutton R.S., McAllester D., Singh S., … [etc]. Policy Gradient Methods for Reinforcement Learning with Function Approximation // Advances in Neural Information Processing Systems. – Vol. 12 (2000). – P. 1057-1063. 64. Symeonidis A.L., Kehagias D., Mitkas P.A. Intelligent policy recommendations on enterprise resource planning by the use of agent technology and data mining techniques // Expert Systemswith Applications. – N 25 (2003). – P. 589-602. 65. Taylor M.E. Transfer in Reinforcement Learning Domains. Studies in Computational Intelligence. – Berlin: Springer-Verlag, 2009. – 244 p. 66. Tokic M. Exploration and Exploitation Techniques in Reinforcement Learning // Invited lecture, Ravensburg-Weingarten University of Applied Sciences, Germany, November 2008. 36

67. Vanthienen J. Ruling the business: About business rules and decision tables // New Directions in Software Engineering. – (2001) pp. 103-120. 68. Varges S., Riccardi G., Quarteroni S., … [etc]. The exploration/exploitation trade-off in Reinforcement Learning for dialogue management // Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding ASRU’09, Merano, Italy, December 1317, 2009. – IEEE Signal Processing Society, 2009. – P. 479-484. 69. Varian H.R. Differential Pricing and Efficiency // First Monday. – Vol.1, Nr. 2 (1996). – P. 1-10. 70. Vengerov D. A Gradient-Based Reinforcement Learning Approach to Dynamic Pricing in Partially-Observable Environment // Future Generation Comp. Syst. – Vol. 24/7 (2008). –P. 687-693. 71. Wang R., Sun L., Ruan X.-G., … [etc]. Control of Inverted Pendulum Based on Reinforcement Learning and Internally Recurrent Net // Proceedings of International Conference on Intelligent Computing, HeFei, China, August 23-26, 2005. – IEEE Computational Intelligence Society, 2005. – P. 2133-2142. 72. Weld M., Weld D. Solving Concurrent Markov Decision Processes // Proceedings of the 19th national conference on Artifical intelligence, San Jose, California, S.J. Convention Center, July 25-29, 2004. – San Jose, California: AAAI Press, 2004. – P. 716-722. 73. Witten I.H., Frank E. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition – Morgan Kaufman, 2005. – 560 p. 74. Беллман Р., Дрейфус С. Прикладные задачи динамического программирования – Москва: «Наука», 1965.– 460 с. 75. Бережная Е.В., Бережной В.И. Математические методы моделирования экономических систем. – Москва: «Финансы и статистика», 2006. – 432 с. 76. Воробьев Н.Н. Предисловие редактора перевода к Беллман Р. Динамическое программирование. – Москва: «Издательство иностранной литературы», 1960. – 400 с. 77. Горбань А.Н. Обобщенная аппроксимационная теорема и вычислительные возможности нейронных сетей // Сибирский журнал вычислительной математики. – Т.1, Nr.1 (1998), с. 12-24. 78. Кузнецов Ю.Н., Кузубов В.И., Волощенко А.Б. Математическое программирование: Учеб. пособие. – 2-е изд., перераб. и доп. – Москва: «Высшая школа», 1980. – 300 с. 79. Растригин Л.А. Современные принципы управления сложными объектами.– Москва: «Советское радио», 1980. – 232 с. 80. Таха Х.А. Введение в исследование операций. – 7-е. изд. – Москва: «Вильямс», 2005. – 912 с. 81. Тёрнер Д. Вероятность, статистика и исследование операций. – Москва: «Статистика», 1976. – 431 с. 82. Фомин Г.П. Математические методы и модели в коммерческой деятельности. – Москва: «Финансы и статистика», 2005. – 616 с. 83. Черноусько Ф.Л. Динамическое программирование // Соросовский образовательный журнал. – № 2 (1998), с. 139-144.

37

Jurijs ČIŽOVS DEVELOPMENT AND STUDY OF A CONTROLLED MARKOV DECISION MODEL OF A DYNAMIC SYSTEM BY MEANS OF DATA MINING TECHNIQUES Ph.D. Thesis Summary

Registered for printing on 31.01.2012. Registration Certificate No. 2-0282. Format 60x84/16. Offset paper. 2,25 printing sheets, 1,78 author’s sheets. Calculation 30 copies. Order Nr. 10. Printed and bound at the RTU Printing House, 1 Kalku Street, Riga LV- 1658, Latvia.

38

Suggest Documents