Modeling Dopamine Activity in the Brain with Reinforcement Learning

Modeling Dopamine Activity in the Brain with Reinforcement Learning A case study KARIN BJÖRKMAN Master’s Degree Project Stockholm, Sweden 2004 TRIT...

Author: Brice Hancock

0 downloads 0 Views 382KB Size

Report

Download PDF

Recommend Documents

Electric Power Market Modeling with Multi-Agent Reinforcement Learning

Reinforcement Learning

Dopamine, appetitive reinforcement, and the neuropsychology of human learning: An individual differences approach

Transfer Reinforcement Learning with Shared Dynamics

Integrating Data Modeling and Dynamic Optimization using Constrained Reinforcement Learning

LEARNING TO PLAY CHESS USING REINFORCEMENT LEARNING WITH DATABASE GAMES

Reinforcement Learning and Control

Learning and the Brain:

Olfactory modulation by dopamine in the context of aversive learning

Bayesian Inverse Reinforcement Learning

Intrinsically Motivated Reinforcement Learning

Reinforcement Learning Memory

Reinforcement learning techniques in RNA inverse folding

DYNAMIC ACTION SEQUENCES IN REINFORCEMENT LEARNING

Deep Reinforcement Learning in Keepaway Soccer

Transfer in variable-reward hierarchical reinforcement learning

Reinforcement Learning in Online Stock Trading Systems

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning and the Basal Ganglia

Neuroanatomy of the reinforcement system of the brain

Learning Strategies in Table Tennis using Inverse Reinforcement Learning

Food Reinforcement, the Dopamine D 2 Receptor Genotype, and Energy Intake in Obese and Nonobese Humans

Modeling Dopamine Activity in the Brain with Reinforcement Learning A case study

KARIN BJÖRKMAN

Master’s Degree Project Stockholm, Sweden 2004

TRITA-NA-E04140

Numerisk analys och datalogi KTH 100 44 Stockholm

Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden

Modeling Dopamine Activity in the Brain with Reinforcement Learning A case study

KARIN BJÖRKMAN

TRITA-NA-E04140

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering, Royal Institute of Technology year 2004 Supervisor at Nada was Örjan Ekeberg Examiner was Anders Lansner

Abstract A serious problem of society today are the drug and their consequences. One approach to solve this problem is to administer a drug that will block the dopamine receptors in the brain. This approach originates in that increased dopamine level caused by drugs is believed to be one of the main reasons for why we become addicted to drugs. One way to study the mechanisms of craving, done by Professor Kenneth F. Green, is to utilize the Conditioned Place Preference method. The dopamine activity in the mesolimbic system has been studied and modeled with Temporal Difference (TD) methods. The focus of these studies has however been on the paring of a stimulus with a reward, i.e. classical conditioning. The purpose of this thesis is to examine whether one can use these existing models, to create a model of how a conditioned place preference among rats, due to amphetamine, is developed. By firstly analyzing and decomposing the existing TD model and then adapting it to the conditions of Green’s experiment, it does become possible to model the dopamine activity during the conditioning process that pairs a certain environment with the craving for amphetamine. Finally, different methods for improving the resulting model are tried out.

Modellering av dopaminaktivitet i hjärnan med hjälp av Reinforcement Learning –En fallstudie Sammanfattning Ett allvarligt problem i dagens samhälle är droger och deras konsekvenser. Ett sätt att angripa detta problem är att behandla drogberoende med ett ämne som blockerar hjärnans dopaminreceptorer. Detta tillvägagångssätt baseras på teorin att ökade dopaminhalter i hjärnan tros vara en av orsakerna till varför vi blir beroende av droger. En metod för att studera hur drogbegär uppkommer och fungerar, vilket professor Kenneth F. Green har gjort, är att använda betingad platspreferens (eng. conditioned place preference). Dopaminaktiviteten i det mesolimbiska systemet har studerats samt modellerats med hjälp av Temporala Differensmetoder (TD). Fokus i dessa studier har emellertid legat på sammankoppling av ett stimulus med en belöning, så kallad klassisk betingning. Syftet med detta examensarbete är att undersöka huruvida man kan använda dessa befintliga modeller för att modellera hur en betingad plats-preferens, orsakad av amfetamin, uppkommer hos råttor. Genom att först analysera den befintliga TD-modellen och därefter anpassa den till de förhållanden som finns i Greens experiment, blir det möjligt att modellera dopaminaktiviteten under betingningsprocessen som parar ihop en viss miljö med begäret efter amfetamin. Slutligen utvärderas olika metoder för att förbättra den framtagna modellen.

Acknowledgments This thesis was written at the Psychology Department at California State University, Long Beach and at NADA, KTH. It was done in close collaboration with MSc student Oya Özuymaz, whom I would like to thank most of all. I would also like to thank Örjan Ekeberg, my supervisor at NADA, for all help and support trough out the entire project, and Professor Kenneth F. Green at CSULB, whom without, this would never have been possible. Thereto I would like to thank my dad, Gunnar Björkman, for all the feedback on my report.

Contents 1 Background 1.1 Machine Learning 1.2 Drug Abuse . . . 1.3 Dayan’s Model . 1.4 Purpose . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 2 3 4

2 Theory 2.1 Reinforcement Learning . . . . . . . . . . . . . 2.2 Different Reinforcement Methods . . . . . . . . 2.2.1 Temporal Difference . . . . . . . . . . . 2.3 The Biological view of Reinforcement Learning 2.3.1 Conditioning . . . . . . . . . . . . . . . 2.3.2 Prediction and Reward . . . . . . . . . . 2.4 Dopamine . . . . . . . . . . . . . . . . . . . . . 2.4.1 Drugs and their effect on dopamine . . . 2.4.2 The role of dopamine in reward . . . . . 2.5 Modeling Dopamine with Temporal Difference .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

6 6 8 9 10 10 11 12 12 13 13

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 Methods 16 3.1 The reward-function . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Improving the model . . . . . . . . . . . . . . . . . . . . . . . 17 4 Results 19 4.1 A continuous reward-function . . . . . . . . . . . . . . . . . . 19 4.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Discussion 23 5.1 The TD(n-step) and Monte Carlo Techniques . . . . . . . . . 23 5.2 The Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6 Conclusions

25

Bibliography

26

Chapter 1

Background 1.1

Machine Learning

If computers could learn to solve problems through trial-and-error, a lot of problems would become solvable, e.g. control systems for aircrafts and automated manufacturing systems. Reinforcement Learning is an approach to machine learning, which combines Dynamic Programming and supervised learning. Together, these two disciplines can solve problems, which neither of them can handle on their own. Dynamic Programming is usually used for solving optimization and control problems, whereas supervised learning is a Machine Learning technique used for prediction of future outcomes. The main drawback of supervised learning is that the models one uses for prediction, requires training on very large data sets, which must contain the correct outcome for the problem instances used for training. [Harmon and Harmon 1996] Reinforcement Learning, on the other hand, is an unsupervised Machine Learning method. Unsupervised means that the program learns instantly, without the use of a a similar exampele as a guide. This is accomplished by processing one example at a time, unlike supervised methods, which need to be trained on an extensive quantity of examples where the answers are known, in order to be able to predict future unseen examples. In order for the Reinforcement Learning program to learn directly from the presented examples, it uses a reward signal, which it receives by trial-and-error interaction with its environment. [Sutton and Barto 1998] Reinforcement Learning is one of the cases in which it is possible to derive computational, psychological and neurobiological constraints, and in which there are theories that satisfy many of the constraints at all levels. One of the various neural systems that have been studied from the perspective of Reinforcement Learning, is the primate midbrain dopaminergic system and its targets. [Dayan 1999] Professor Peter Dayan at University College London has in a number

1

of experiments showed that Reinforcement Learning is a suitable method when modeling the midbrain dopamine system. The specific Reinforcement Learning algorithm that Dayan has used is “Temporal Difference” [Dayan 1999]. The experiments of Professor Dayan and others all describe and model how the dopamine level changes in time when an animal learns that it will receive a small reward when performing a simple task.

1.2

Drug Abuse

Dopamine can be seen as the body’s own reward-system. Every time we do something good, an act that makes us feel at ease, it is actually the brain releasing dopamine. When we abuse drugs, a lot more dopamine than usual is released, resulting in a great feeling of satisfaction. As the effect of the drugs decreases we start carving for new ones, due to a significantly lower dopamine level than during the rush. As we take the drug over and over again, we get used to the extremely high levels of dopamine, thereby getting addicted to them. This is why dopamine plays such an important role in why we become addicted to drugs. A vast problem of society today is drugs and the abuse of them. One approach to solve drug related problems is to administer a drug that will block the dopamine receptors in the brain, thus preventing the abuser from desiring drugs. The focus on the craving for drugs depends on the fact that it is considered to be one of the most difficult problems that a therapist will face when treating drug addicts. It has been suggested that craving is caused by a conditioning process where repeated pairing of environmental cues with drug effects sensitizes the reactions of the addict to the environmental cues. Dopaminergic activity in the mesolimbic pathway is thought to underlie this sensitization and to imbue the environmental cues with incentive salience. The mesolimbic system. located in the midbrain, reacts to drugs of abuse by releasing dopamine at its terminals in response to the most rewarding substance, thus producing increased and decreased reward associated with dopaminergic administration and blockade [Berg and Green 2004]. A common way of studying the mechanisms that underlie craving is to utilize the Conditioned Place Preference method, CPP. In this method the rewarding effects of a drug is paired with a specific environment and the resulting craving is then measured by monitoring the amount of time spent in the drug-associated environment, compared to one or more other environments. One promising strategy for treating drug addicts is to reduce their cravings by administering an agent that reduces activity of dopamine in the mesolimbic system. One such agent is 7-OH-DPAT, a dopamine agonist that 2

is selective for D3 -receptors, which are located on neurons in the VTA, the substantia nigra and in terminal zones of the mesolimbic system [Berg and Green 2004]. Professor Kenneth F. Green at California State University, Long Beach does research on this type of drugs and their possible usage in treating drug addicts. The focus of Green’s latest experiment is on how rats develop a Conditioned Place Preference, CPP, for a certain environment when given amphetamine and then placed in that specific setting. He then continues by examining how the CPP is influenced as the rats are given a dopamine blocker[Dayan et al. 1997]. The development of the conditioned place preference among the rats can here be viewed as a process where the rats learn to prefer one side of a box. In this specific experiment, the rats were conditioned to perfer a gray side over one that was striped [Berg and Green 2004]. The CPP was brought about by the following course of action; At time 0, the rats are injected with the amphetamine in a neutral cage. After 15 minutes, i.e. time 15, the amphetamine starts to influence the rats and they are therefore placed in the grey compartment at this time. The rats are kept in the grey are for 20 minutes, after which the effects of the amphetamine start to decline. The procedure is repeated four times, where after the rats are conditioned on the gray side.

1.3

Dayan’s Model

Dayan and others have conducted several experiments, which examine how the dopamine level is influenced as an animal learns that it will receive a small reward when performing some sort of simple task. In one of these experiments, Dayan performs a test where a monkey is presented with a light, a sensory cue, and then given a fruit juice, as a reward when it presses a lever in its cage. As the monkey is trained on the same procedure over and over again it will eventually learn that by pressing the lever, it will receive the fruit juice. This procedure is called “conditioning”. [Dayan et al. 1997] The part of this phenomenon that Dayan discusses and models in greater detail is the role and function of the midbrain dopamine neurons in predicting the reward. On the first trial the neurons will fire at the presentation of the fruit juice, but as the monkey slowly learns to press the lever in order to receive the reward, the neurons will gradually change their time of firing to match the presentation of the light, the cue [Dayan et al. 1997]. The exact procedure for the experiment was as follows: the sensory cue was presented to the monkey at time step 10 and the reward at time step 60. The two plots below show the prediction error and value function of the resulting model. In Figure 1(a) one sees a spike, a strong positive response, in the first trial at time step 60, which represents the delivery of the reward. The spike 3

(a) A plot of the prediction error. The spikes represent a positive respectively a negative reward given to the monkey. They arise due to surprises experienced by the monkey as it receives the fruit juice the first time, the positive spike. The negative spike depends on a disappointment caused by a nonappearing reward in an intermediate trial.

(b) The value function V, a fundamental part of the RL, begins to grow as the monkey starts to expect the reward. The absent juice can be detected here too, indicated by the small reducement at approximately trial 20.

Figure 1.1: The model presented in Science by [Dayan et al. 1997]. This model became a guide and a reference during the process of adapting Dayan’s model to Green’s experiment

is due to the fact that the monkey did not anticipate any reward at all. By repeating the pairing of the cue followed in time by the reward, the response of the model will shift to the time of the sensory cue, indicating that the monkey has learned to predict the reward [Dayan et al. 1997]. Figure 1(b) shows how the value function begins to grow.

1.4

Purpose

Studies of how the dopamine activity can be modeled with Temporal Difference when a dopamine blocker is administrated to the animal in question as well as how 7-OH-DPAT influences place preference have been conducted [Berg and Green 2004]. We have however not found any studies during our search of material, which combine these two areas explicitly, i.e. no attempt to model the dopamine activity during the development of place preference with TD have been found. The purpose of this thesis is to examine whether the model that Dayan has developed is applicable on Green’s research or not. Will it be possible to transform Dayan’s model in such way that it can be applied to Green’s experiment and thereby model how the craving for amphetamine make the rats prefer the gray side of the box. 4

I will only examine and model the conditioning phase of Green’s experiment, and thereafter attempt to improve the resulting model by using alternative Reinforcement Learning-methods. I will not consider how the rats react to the dopamine blocker, which Green does in his work. For a study of how the effects of 7-OH-DPAT can be modeled, please see MSc student Oya Özuymaz’s thesis, Modeling of Dopamine Activity and the Effect of 7-OH-DPAT in the Mesolimbic system, NADA, KTH 2004.

5

Chapter 2

Theory 2.1

Reinforcement Learning

Reinforcement Learning is about learning how to behave in order to achieve a goal. The “thing” that learns is referred to as an agent. The agent learns by interacting with the environment, which consists of everything outside the agent, i.e. anything that the agent cannot influence is in its environment. At each time step of the learning procedure, t, the agent will receive some representation of the environment’s state, st ∈ S; S=all possible states. Every step that the agent takes toward its goal is called an action, a t ∈A(st ), A(st ) = all actions in that state. [Sutton and Barto 1998] When we talk about the environment, what we actually refer to is a model of the environment. By a model of the environment, we mean anything that the agent can use to predict how the environment will respond to its actions. [Sutton and Barto 1998] One of the most important environment models used in Reinforcement Learning is that which holds the Markov property. By this we mean that the next state only depends on the current state and action, and randomness. A task that satisfies the Markov property is called a Markov Decision Process and essentially all Reinforcement Learning methods utilize the Markov property when solving a Reinforcement Learning task. [Sutton and Barto 1998] In some learning tasks we are able to find a natural last time step, e.g. finding the way out of a maze. In these tasks, we call each round of the problem an episode. Tasks like this, we therefore call episodic tasks. Tasks that lack this natural, terminal state, are called continuing tasks. [Sutton and Barto 1998] In addition to the agent and its environment, Reinforcement Learning consists of three essential parts: policy, reward function and value function. A policy, denoted π, is a mapping from state s and action a to the

6

probability π(s,a) of taking a when in s. The policy does in other words determine how the agent behaves at a certain point, i.e. what actions it will take. [Sutton and Barto 1998] The general, and only, goal of an agent is to maximize its total expected reward in the long run. The reward function represents the reward, which a state is associated with. It defines which actions are good and bad choices for the agent. The reward function must, due to this, be unalterable for the agent. It is very important to define the reward in such a way that it indicates what the agent’s goal is. The reward signal must tell the agent what to accomplish, not how. The total expected reward of a state, denoted Rt , can formally be expressed as the expected return of that state, which is a sequence of rewards, rt , received by the agent when starting in state s, at time step t, and reaching its goal state at time step T, Rt =

T X

γ k rt+k+1 γ discount rate, 0 ≤ γ ≤ 1

k=0

The value function defines what is good for the agent in the long run. The value of a state is the total, expected reward received in the future, starting from that specific state. Since the future rewards depend on what actions the agent performs, the value function is defined with respect to the particular policy currently followed. We denote the value of a state V π (s). One can estimate the value functions of a state by averaging the actual returns, which have followed the state. A fundamental property of the value function used in both Reinforcement Learning and Dynamic Programming is that they satisfy a particular recursive relationship, called the Bellman equation for Vπ . It expresses the relationship between the value of a state and the value of its successor states. The difference between rewards and values is that rewards are given directly by the environment, whereas values are estimated and then reestimated by the agent due to observations made by it during its lifetime. A high reward does not imply that the state has a high value and vice versa. [Sutton and Barto 1998] Reinforcement Learning Feedback One cannot simply tell an Reinforcement Learning system to try out different actions and expect it to learn without giving it any feedback. Without feedback, the system has no chance of knowing which actions are good and which are bad. It is only through feedback that the Reinforcement Learning agent learns. [Harmon and Harmon 1996] The most important characteristic, which distinguishes Reinforcement Learning from other Machine Learning methods, is the type of feedback that 7

is used. Reinforcement Learning uses evaluative feedback, which indicates how good the taken action was, not whether it was the best or worst one. Instructive feedback, which is used in supervised learning, only indicates which action is the best one to take [Sutton and Barto 1998]. One example of feedback to a Reinforcement Learning system is when an action causes something bad to happen immediately, the system is informed that, that action should not be chosen, and if all actions of a state turn out to lead to bad results, then that state should be avoided as well. [Harmon and Harmon 1996] The purpose of all Reinforcement Learning methods is to specify how an agent should change its policy due to its experience, in order to ultimately reach the goal. At every state, the agent is faced with the problem of deciding whether it should continue to follow the current policy or switch to a new, better policy. The first step of the process described above is to evaluate the current policy. Once the policy has been evaluated, the next step is to improve the policy with respect to the current value function. One concept that captures both policy evaluation and policy improvement is Generalized Policy Iteration, GPI. The idea of GPI is that the approximate policy and value function should interact in such a way that they both move towards their optimal values. One important problem in Reinforcement Learning is the explorationexploitation dilemma. It means that it is not always best for the agent to choose the action that has the highest value; even tough the goal of the agent is to maximize the return. There may exist a path which has an even higher value, but which has not yet been explored by the agent. By acting greedy at all times, only exploiting the known paths, the agent becomes less likely of finding the optimal solution the problem it is trying to solve. It is of great importance to find a good balance between exploring and exploiting. [Sutton and Barto 1998]

2.2

Different Reinforcement Methods

There are two different kinds of methods in RL. One which requires a model of the environment, such as Dynamic Programming, DP, and heuristic search, and one which can do without the model, such as Monte Carlo, MC, and Temporal Difference, TD. The methods that require a model are thought of as planners, whereas the other kinds of methods are called learners. These methods do however share a great deal of properties as well; the most essential one is that they both compute value functions. Further, all the methods are based on looking ahead on future events, computing a back-up value, and then using this value to update the approximation of the value function.

8

Dynamic Programming is a collection of algorithms used for solving optimal problems when the environment of the problem is a Markov Decision Problem. Dynamic Programming is not too useful in Reinforcement Learning problems in practice, but in theory it is quite useful. The key idea of both Dynamic Programming and Reinforcement Learning is to use a value function in the search for good policies of how one should act. Dynamic Programming-algorithms are obtained by turning the Bellman equation into update rules for improving the current approximation of the value function. [Sutton and Barto 1998] Monte Carlo methods are ways of solving Reinforcement Learning problems, without complete knowledge of the environment. Instead they learn by experience from sample transitions generated from the model. In order to solve the problem, Monte Carlo methods do consequently not need the environment’s entire probability distribution of all possible transitions, as Dynamic Programming would. Monte Carlo-methods are on the other hand only defined for episodic tasks. [Sutton and Barto 1998]

2.2.1

Temporal Difference

Temporal Difference, TD, is a central idea of RL and it combines ideas of both MC and DP. The basic idea of TD is to continually update the estimate V(st ), by comparing what we believe to be a correct value of st and what actually happens in the next step, st+1 . The error of our estimate is called the TD-error, δ(t) = [rt+1 +γV (st+1 )−V (st )] , and it is used both to improve the estimates of V(t) and to choose the appropriate action henceforth. The re-estimate of V(st ) is accomplished trough the update-rule below, known as TD(0) V (st ) ← V (st ) + α[rt+1 + γV (st+1 ) − V (st )] As DP, TD bases its updates of V on existing estimates. This method of making a guess based on other guesses is called bootstrapping. The resemblance to MC is that also TD can learn from experience without a model of the environment. From the equation above one sees how TD combines the sampling of MC by using the estimate V(st ), since Vπ (s) is not known, and the bootstrapping of DP. The incorporation of bootstrapping in the updating of V(s t ) gives TD a great advantage in processing time compared to MC. While MC-methods have to wait until the return of a visit to st is known and then use that return to update V(st ), TD only has to wait one single time step in order to perform its update.

9

Eligibility Trace Eligibility traces are one of the basic mechanisms for RL. Almost all TDmethods can be combined with ET to obtain a more general method, which may learn more efficiently. Adding ET to TD-methods gives us methods, which can be viewed as methods in between MC and TD. Another view of ET is that the ET, is a temporary record of the occurrence of an event, such as visiting a state or taking an action. The trace marks the current parameters as eligible for undergoing learning changes. When a TD-error then occurs, only theses states or actions will be blamed or credited for the error. This view is called the backward view, whereas the previous one is called the forward view. The n-step TD-methods are examples of methods which lie in between MC and TD. These methods use backups over n-steps, (n) compared to the one-step TD and full backup of MC. The return, R t , is truncated after n steps and gives an estimate of the value of the remaining n states. The n-step backup is defined by [Sutton and Barto 1998] (n)

∆Vt (st ) = α[Rt

2.3

− Vt (st )]

The Biological view of Reinforcement Learning

Reinforcement Learning can also be viewed as the study of how animals learn about events in their environment that result in rewards or punishments, received by the animal. The information about the event will be used by the animal for future action selections in a similar situation. Since animals are not told what to do in certain situations, they must work out their actions by themselves, by this we can say that RL is not supervised by any external source [Dayan 1999] A common task that computerized RL-methods are designed to solve is finding the right way trough a maze. This task is also widely used in experiments with animals, in most cases a rat, which has to learn trough experience which ways that are bad ones, i.e. result in a punishment, and which ones that are correct and thereby reward giving. This means that the rat has to explore new ways and exploit the ways it already knows from earlier experience to be good. [Dayan 1999]

2.3.1

Conditioning

The ability of animals to learn the relationship between stimuli and actions and rewards/punishments, and to choose a strategy (policy) to follow accordingly is the focus of the field of behavioral psychology. The field is usually divided into classical conditioning and instrumental conditioning [Dayan 1999] 10

In classical conditioning we have a unconditional stimulus (US), which elicits a response called the unconditional response (UR). This response is the natural way for the animal to behave when it encounters the US. The other stimulus in classical conditioning is the conditional stimuli (CS), which after a number of encounters will cause a certain behavior in response to the CS, called the conditioned response. We say that classical condition occurs when a neutral (new) stimulus is directly followed by a stimulus that automatically elicits a response. Through this, the animal will learn to respond under new conditions [Carlson 1985]. This can be interpreted, as an expression of the predictions animals will make of the delivery of reward [Dayan 1999] Instrumental conditioning occurs when the consequences of a response to a stimuli leads to an increased likelihood of responding the same way again in that particular situation. This stimulus will be referred to as a reinforcing stimulus (RL). In order for a response to a stimulus to be reinforced by its consequences it must first occur. Reinforcement is therefore an effect of the reinforcing stimuli [Carlson 1985]. Instrumental conditioning can consequently be interpreted as the animals’ way of learning what actions to take, i.e. adapting their behaviors [Dayan 1999]

2.3.2

Prediction and Reward

A prediction of an event provides information about the event before it actually occurs [Schultz 1997]. The ability to predict gives animals the advantage of being able to decide between alternative courses of actions. To choose the correct action at a certain point is a very crucial act, since the right action may lead to food, water etc. whereas a bad action may result in an injury or loss of resources. Higher animals can predict many different aspects of their environment, among others the probable time and size of future rewards [Dayan et al. 1997]. One connection between reward and prediction has been established through a variety of conditioning experiments. Before the experiments, the animal has no reward associated with the stimulus, during the experiment the stimulus is followed by a rewarding object every time it is presented. This repeated pairing leads to that the animal will start to predict the time and size of the reward when it is presented with this certain stimulus. [Dayan et al. 1997] An other connection between prediction and reward is the term prediction error. The prediction error for a reward is the difference between the amount of reward that is delivered and the amount that is expected. [Dayan 1999]

11

2.4

Dopamine

Dopamine belongs to a family of compounds called monoamines, which are produced by different systems in the brain, mostly in the brain stem, from where they are distributed throughout the brain. Dopamine is one of the more interesting neurotransmitters since it is implicated in several important functions, including the reinforcing effect of drugs. [Carlson 2001]

2.4.1

Drugs and their effect on dopamine

It has been observed that major drugs of abuse exert their addictive properties through dopamine mechanisms in the nucleus accumbens and frontal cortex. Addictive drugs such as cocaine, amphetamine, opiates and nicotine can thus be seen as positive reinforces. It is easy to get animals to learn to give themselvs such drugs. The ability of a drug to act as a positive reinforcer that sustains behavior in experimental animals, is highly correlated with the abuse potential of the drug for humans. [Kandel et al. 2000, Schultz 1997, Wise 1996] Drugs that are both psychoactive and reinforcing also increase the levels of dopamine at the terminals of the projections of the ventral tegmental area, VTA. Cocaine and amphetamine accomplish this by blocking the dopamine transporter, thereby prolonging the time dopamine remains in the synaptic cleft. One possible way of controlling the addiction of cocaine and amphetamine could be to develop drugs that have the dopamine transporters as their targets. [Kandel et al. 2000] The craving for drugs is one of the most difficult problems that a therapist will face when treating drug addicts. It has been suggested that a conditioning process causes the craving for drugs, where repeated pairing of environmental cues with drug effects sensitizes the reactions of the addict to the environmental cues. Dopaminergic activity in the mesolimbic pathway is thought to underlie this sensitization and to imbue the environmental cues with incentive salience. The mesolimbic system reacts to drugs of abuse by releasing dopamine at its terminals in response to the most rewarding substance, thus producing increased and decreased reward associated with dopaminergic administration and blockade. [Berg and Green 2004] A common way of studying the mechanisms that underlie craving is to utilize the Conditioned Place Preference method, CPP. In this method the rewarding effects of a drug are paired with a specific environment and the resulting craving is then measured by monitoring the amount of time spent in the drug-associated environment, compared to one or more other environments. One promising strategy for treating drug addicts is to reduce their cravings by administering an agent that reduces activity of dopamine in the mesolimbic system. One such agent is 7-OH-DPAT, a dopamine agonist that 12

is selective for D3 -receptors, which are located on neurons in the VTA, the substantia nigra and in terminal zones of the mesolimbic system. [Berg and Green 2004]

2.4.2

The role of dopamine in reward

There is substantial evidence that dopamine is involved in reward learning. Some of the evidence for this are drugs and the way they make us addicted to them. Drugs such as cocaine and amphetamines act, as described above, partly by prolonging the life of the dopamine that is released to the nucleus accumbens, among other parts of the brain. [Dayan 1999, Dayan and Kakade 2002] Another source of evidence for the role of dopamine is self-stimulation experiments. By attaching electrodes to various areas of the brain and delivering a small current to these areas when the animal presses a lever, it will easily learn to press the lever over and over again. Since rats prefer this stimulus to food and sex, it must be a rewarding stimulus. When a receptor blocker such as the antipsychotic drug haloperidol is given, it will reduce the rewarding effects of food and intracranial self-stimulation. This supports the hypothesis that dopamine has a general role in the reinforcement mechanism in the limbic areas, a hypothesis based on the fact that dopamine is released when the current hit the brain. [Kandel et al. 2000, Dayan 1999, Dayan et al. 1997] Dopamine neurons do, however, not simply report the occurrence of appetitive events. Rather, their output appear to code for deviation or error between the actual reward received and the predictions of the time and magnitude of reward. The neurons will emit a positive signal (a positive error) if an appetitive event is better than predicted; no signal if an appetitive event occurs as predicted (no error), and a negative signal if an appetitive event is worse than predicted (a negative error). A dopamine neuron can in other words be thought of as feature detector for how good an event is, relative to prediction. The dopamine response does in this way provide the information needed to implement a simple behavioral strategy (policy) – take actions correlated with increased dopamine activity and avoid actions correlated with decrease in dopamine activity. This interpretation provides a link to an established body of computational theory. [Schultz 1997, Dayan et al. 1997, Dayan and Kakade 2002]

2.5

Modeling Dopamine with Temporal Difference

The TD-algorithm has been found to be particularly well suited to understand the functional role, played by the dopamine signal in terms of the information it constructs and broadcasts. The technique has been used in a 13

wide spectrum of engineering applications that seek to solve prediction problems analogous to those faced by living creatures since Richard Sutton and Andrew Barto introduced it into the psychological and biological literature in the early 1980’s. [Dayan et al. 1997] In order for a neural system to construct and use a prediction error signal, similar to the TD prediction error, it would need to process the following features: i access to a measure of reward value r(t) ii a signal measuring the temporal derivate of the ongoing prediction of reward γV (st+1 ) − V (st ) iii a site where these signals could be summed iv delivery of the error signal to areas that construct the prediction in such way that the signal can control plasticity It has been suggested that the midbrain dopamine neurons satisfy feature one through three. This supports the hypothesis that the input to the dopamine neurons arrives in the form of a surprise signal that measures the degree to which the current sensory state differs from the last sensory state [Dayan et al. 1997]. When one wants to model the dopamine function computationally, one must start by identifying the two main assumptions of TD from the biological point of view. This is accomplished on one side by allowing the sensory cue to predict the sum of future rewards and on the other hand fulfill the Markov property through the fact that the presentation of future sensory cues depend only on the current cues. [McClure et al. 2003, Dayan et al. 1997] One way to model this is accomplished by utilizing one vector to describe the presence of sensory cues, x(t), and one vector for keeping track of the adaptable weights, w(t), which we use to estimate of the true V π (s) [Dayan et al. 1997]. However, not only the cues themselves need to be taken under consideration, but also time and its effect on the cues. There are experimental data[Dayan et al. 1997], which shows that a sensory cue can predict reward delivery at arbitrary times into the near future. We need therefore to represent every cue as a vector x(t) = {x1 (t), x2 (t)...xn (t)}, of signals that represent the cue at different time steps. The net prediction for cue x(t) at time t is given by V (st ) = Σi wi xi (t) The adaptable weights w are improved according to the correlation between the stimulus representation and the prediction error, which gives us ∆wi = αi Σi xi (t)δ(t) 14

It has been shown that this update rule will converge to V ∗ , the optimal value function, under certain conditions. [Dayan et al. 1997, Sutton and Barto 1998]

15

Chapter 3

Methods The two most important parameters in Dayan’s model are the sensory cue and the reward. Each sensory cue, stimulus, is represented by a large number of signals, xi (t). Each of these signals symbolize the presence of the cue in state s at exactly one specific time step in the future, that is, x i (t) will be 1 exactly i time steps after the presentation of the cue and 0 at all other time steps. [Dayan et al. 1997, Suri 2002] We represent the time points as a matrix where each row represents the different states of the experiment and each column represents the cue at different times in the future. Each cue-element, xi (t), also has a prediction weight associated with it. By improving these weights according to the correlation between the stimulus and the prediction error, we get the model to learn to predict the reward. The weights are initially all set to 0. The presence of the reward r(t) is represented by a scalar, which is 1 when a reward is present and 0 otherwise. [Dayan et al. 1997] The modeling was done with TD(0). Every trial consists of a number of states, which in this case is represented by time steps. In each trial, every state is visited once, calculating the prediction error and change in weights. Finally, the new value function is calculated. This gives us the following algorithm for all trials do δ(t) = r(t) + γ ∗ V (st+1 ) − V (st ) W (t) = W (t) + α ∗ δ(t) ∗ x(t) end for where α is the learning rate and γ is the discount rate. The value of these parameters can be varied in order to adjust the model, within the range 0 < α ≤ 1 and 0 ≤ γ ≤ 1. The value of the learning rate reflects how “fast” the agent will learn, a high value leads to a fast-learning agent. In this case,

16

a large α value results in a greater change in w, as can be seen above. The value of γ describes how heavy the model will take future rewards into consideration. A high value means that future rewards are regarded as important. An appropriate value of γ is in the range of 0.9 − 0.99 [Suri 2002, Ekeberg] Finally we calculate the new value function for all states P V (st ) = i wi xi (t)

3.1

The reward-function

As mentioned above, only four trials are needed for the rats to develop a preference for the gray side of the box. In order to create a model that resembles this reality better, a linear curve was suggested as a good way to represent the increasing influence of amphetamine. As Green appeared to be of the opinion that this was a good approximation of how amphetamine affects the rats, the scalar-reward was replaced with a linear function, peaking at time step 30. Thus receiving a model with continuous reward. Green, not being an expert on the properties of amphetamine, recommended us to talk to Professor Alexander L. Beckman at the Psychology Department, in order to get a more exact estimate of the amphetaminefunction. When consulted, Beckman gave his view of how the amphetaminecurve ought to be approximated by sketching a function. We drew the conclusion that his drawing was best interpreted as a sinus curve, reaching its π ). top at 30, i.e. sin(t 60

3.2

Improving the model

The TD-method used to improve the model was n-step TD. It does, as mentioned above, take more than one, more precisely n, future states under consideration when calculating the new value function. TD(0) looks one step ahead, and can therefore be thought of as 1-step TD. In this experiment n was set to five, i.e. the algorithm will look five steps ahead [Sutton and Barto 1998]. The reason The difference between TD(0) and n-step TD (n ≥ 2) is the reward term used in the update-rule. The information of interest for the algorithm when it looks ahead is the reward anticipated in those future states. In order to utilize this information to update the value function of the current state, a n-state return is calculated from these returns (n)

Rt

= rt + γrt+1 + γ 2 rt+2 + ... + γ n

The second approach tried out was the Monte Carlo, MC, technique to see if it might be a useful alternative. Since MC takes all future states in 17

consideration it might work out satisfactory in this case where the model only has few trials for adapting the value function. The reason for choosing TD(5-step) as one of the methods for improving the model was simply because it can be thought of as an intermediate version of TD(0) and Monte Carlo.

18

Chapter 4

Results The first modeling attempt was to emulate Dayan’s model as close as possible. The resulting model did not learn well at all, even though a very high value on the α -parameter was used. This depends on that there are only four trials available. When we ran the model on 100 trials and an α-value of 0.15 we got a model that learned well; γ was set to 0.99.

4.1

A continuous reward-function

Running the model with a continuous reward-function, compared to the scalar reward used in the attempt above, and on only four trials worked out well. Both the linear and the sinus curve gave rise to a model that learns to predict the reward in only four trials, indicated by the spike at time step 15, i.e. the state when the rats enter the gray box. The two obtained models can be seen in Figure 4.1 and 4.2 respectively. t 30 31 32 33 34

linear reward function 0.2012 0.2077 0.2142 0.2137 0.1733

sinus reward function 0.1947 0.1939 0.1926 0.1853 0.1461

Table 4.1: The prediction error for the last five time steps received as the model is run with a linear reward function respectively the sinus shaped reward function. The model with the sinus function is able to learn better.

19

Dopamine activity in the mesolimbic system

1 0.8 0.6

Prediction error

0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 1 2 3 4

0

10

5

15

20

25

30

35

Time

Trial

Figure 4.1: The prediction error received when a linear reward was used. The sharp spike at time step 15 coincides with time for the placement of the rats in the gray compartment. The drop, which starts to emerge in trial 2, indicates that the effect of the drugs is predicted by the model, i.e. paired with the gray environment. The model learned better than when a scalar reward value was used, but not completly satisfactory.

Dopamine activity in the mesolimbic system

1 0.8 0.6

Prediction error

0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 1 2 3 4

0

10

5

15

20

25

30

35

Time

Trial

Figure 4.2: The prediction error received when the sinus function was used. This model learned better than the one seen in Figure 4.1.

20

t 30 31 32 33 34

TD(0) 0.1947 0.1939 0.1926 0.1853 0.1461

TD(5-step) 0.0921 0.0919 0.0909 0.0884 0.0818

Table 4.2: The prediction error for the last five time steps received with TD(0) and TD(5-step), a clear improvement of the model. Dopamine activity in the mesolimbic system

1 0.8 0.6

Prediction error

0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 1 2 3 4

0

10

5

15

20

25

30

35

Time

Trial

Figure 4.3: The prediction error received when TD(5-step) was used, a further improvment of the model.

4.2

Improvements

The model received when TD(5-step) was used, differs from the model produced with TD(0) in two ways. Firstly the prediction error in the first trial rises considerably higher than when TD(0) is used, as seen in Figure 4.3, and secondly the prediction error received in the forth trial is closer to zero. When run with only four trials, where each represents one session for the rats in the gray compartment, the MC-model did not do as well as TD(5step) did. In order to examine the possible utilization of MC more closely, the model was repeatedly run with one more trial added to it. After five further rounds, resulting in a total of nine trials, the model did as well as TD(5-step). A different approach to further adapt the MC-model was also tried out. It meant that the learning parameter α was adjusted in order to get a better 21

result. By raising the value of α to 0.8, the model managed to learn in only four trials.

22

Chapter 5

Discussion The first model developed, had only a scalar reward at t=15 and it did not learn until 100 trails were carried out. This makes it more of a transitional step towards the final model, together with another model produced to test the effects of a non-occurring reward in an intermediate trial. According to Dayan, the non-occurring reward will result in a reduced firing among the dopamine neurons due to the reduced prediction error that occurs [Dayan et al. 1997]. This result combined with the resemblance between Dayan’s model and our first model convinced us that we had managed to use the model Dayan presented in his article on Green’s work. Looking at the performance of the two models with a continuous reward function, one notices that they both learn to predict the effect of amphetamine in only four trials. This strong improvement depends in full on the new reward function adapted to resemble the course of how amphetamine influences the rats. Figure 4.1 and Figure 4.2 show the respective models, which together with Table 4.1 make it clear that the sinus curve gives rise a model that learns better than a model with a linear reward function. This conclusion is based on the observation that the prediction error for trial 4 in the model with the sinus based function, is smaller, i.e. the model has learned to expect the positive effect of the drugs to occur once it enters the gray compartment.

5.1

The TD(n-step) and Monte Carlo Techniques

The TD(5-step) model differs from the model with TD(0) in two ways. Firstly the prediction error in the first trial rises considerably higher than when TD(0) is used, and secondly the prediction error received in the forth trial is closer to zero, which can be seen in Figure 4.3 and Table 4.2. Once again seen by the smaller values of the prediction error. The second approach tried out was the Monte Carlo technique to see if it 23

might be a useful alternative. Since MC takes all future states in consideration it might work out fine in this case where the model only has few trials for adapting the value function. When run with only four trials, the MC-model did not do as well as TD(5-step) did. In order to examine the possible utilization of MC more closely, the model was repeatedly run with one more trial added to it. After five further rounds, resulting in a total of nine trials, the model performed equally good as TD(5-step). A different approach to further adapt the MC-model was also tried out. It meant that the learning parameter α was adjusted in order to get a better result. By raising the value of α to 0.8, the model managed to learn in only four trials. It was not a surprise that TD(5-step) performed better than TD(0), since the reward of TD(5-step) is based on a larger amount of information about future events, which ought to make it easier to learn correctly. What did surprise us was the fact that the MC-model did not manage to imitate the dopamine activity in the rats, as well as TD(5-step). Since the MC-technique looks all the way to the final state to calculate its return, it is not an unlikely anticipation that it will mimic the dopamine process even better than TD(5-step). As seen above, the MC-technique did not improve the original model as much as TD(5-step) did, and must therefore be considered a less preferable alternative for improving the performance of the model. One improvement that was not an option in this case, is the use of Eligibility Trace, ET. Since the model in question lacks the possibility of different action-selections at each state, it is not the typical representation seen in RLsituations, consequently eliminating ET as a way of enhancing the model’s ability to estimate the correct value function.

5.2

The Future

Even though the resulting model does not posses all the characteristics that a TD-model usually does, our model does seem to be able to learn how the rats develop a conditioned place preference for the gray compartment. It would therefore be of great interest to develop a model of a similar experiment and then compare it to the outcome of the experiment and not, as done in this thesis, use the experiment as a reference. The ultimate future use of these kinds of models would of course be as a substitute for laboratory animals. It would not only mean fewer animals used for experiments, but also the possibility of conducting many more tests since our model only need seconds to simulate how the rats behave during four days.

24

Chapter 6

Conclusions By decomposing and analyzing the parts of the existing Temporal Difference model, and finally adjusting the parameters to fit the conditions of Professor Green’s experiment, a new model was constructed. When running this model with the data received from Green, we obtained a model that was able to learn how the dopamine activity in the brain changes as a result of the administrated amphetamine. Once equipped with a functioning model, improvements were made by the utilization of different TD methods, the TD(n-step) and Monte Carlo techniques. The results showed that TD(n-step) did produce an improved model, whereas Monte Carlo did not.

25

Bibliography K.N. Berg and K.F. Green. Effects of 7-OH-DPAT on place preference in rats. Manuscript, 2004. N. R. Carlson. Physiology of Behavior, pages 506–540. Allan and Bacon, 3rd edition, 1985. ISBN 0-205-08501-6. N. R. Carlson. Physiology of Behavior, pages 113–118. Allan and Bacon, 7th edition, 2001. ISBN 0-205-32407-X. P Dayan. Theoretical Neuroscience. Draft: March 18, 1999, 1999. P. Dayan and S. Kakade. Dopamine: generalization and bonuses. Neural Networks, (15):549–559, 2002. ISSN 0893-6080. P. Dayan, W. Schultz, and P.R. Montague. A neural substrate of predication and reward. Science, (275):1593–1599, 1997. ISSN 1095-9203. Ö. Ekeberg. Course material for ANN, spring 2003, KTH, Sweden. Given in the course on ANN, spring 2003, KTH, Sweden. M.E. Harmon and S.S Harmon. Reinforcement learning: A tutorial. http://www.intellektik.informatik.tudarmstadt.de/ klausvpp/GAILS/node20.html, 1996. E. R. Kandel, J. H. Schwartz, and T. M. Jessell. Principles of Neural Science, pages 1008–1012. McGraw-Hill, 4th edition, 2000. ISBN 0-8385-7701-6. S.M. McClure, N.D. Daw, and P.R. Montague. A computational substrate for incentive salience. TRENDS in Neurosciences, 26(8):423–428, 2003. ISSN 0166-2236. W. Schultz. Dopamine neurons and their role in reward mechanisms. Current Opinion in Neurobiology, (7):191–197, 1997. ISSN 0959-4388. R. E. Suri. A biologically-inspired concept for active image recognition. www.intopsys.com/technical/ImageRecogniation.pdf, 2002.

26

R.S. Sutton and A.G. Barto. Reinforcement Learning: an introduction. The MIT Press, 1998. ISBN 0-262-19398-1. R. A. Wise. Neurobiology of addiction. Current Opinion in Neurobiology, (6):243–251, 1996. ISSN 0959-4388.

27