HELSINKI UNIVERSITY OF TECHNOLOGY Department of Engineering Physics and Mathematics

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Engineering Physics and Mathematics Kimmo Berg A Model of Monopoly Pricing under Incomplete Informati...

Author: Meredith McDaniel

2 downloads 0 Views 653KB Size

Report

Download PDF

Recommend Documents

Engineering; Mathematics; Physics and 9369 Dalian University of Technology

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Electrical and Communications Engineering Laboratory of Telecommunications Technology

UNIVERSITY OF SASKATCHEWAN Department of Physics and Engineering Physics

Helsinki University of Technology Department of Mechanical Engineering. Energy Engineering and Environmental Protection Publications

DEPARTMENT OF MATHEMATICS AND PHYSICS

Helsinki University of Technology Department of Chemical Technology Laboratory of Bioprocess Engineering

HELSINKI UNIVERSITY OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING TELECOMMUNICATIONS SOFTWARE AND MULTIMEDIA LABORATORY

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Civil and Environmental Engineering Laboratory of Environmental Protection

Helsinki University of Technology Publications in Materials Science and Engineering

Helsinki University of Technology Department of Electrical and Telecommunications Engineering HOMOTOPY METHODS IN DC CIRCUIT ANALYSIS

Helsinki University of Technology Laboratory of Computational Engineering Publications

EINDHOVEN University of Technology Department of Mathematics and Computer Science

UNIVERSITY OF LJUBLJANA FACULTY OF MATHEMATICS AND PHYSICS DEPARTMENT OF PHYSICS MAGNETIC BRAKING

SRM UNIVERSITY FACULTY OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF ICE

UNIVERSITY OF HELSINKI DEPARTMENT OF PHYSICAL SCIENCES

Department of Physics, Bangladesh University of Engineering & Technology (BUET), Dhaka-1000, Bangladesh

Lamar University Department of Mathematics

The Rzeszow University of Technology, Department of Mechanical Engineering, Poland

TAMPERE UNIVERSITY OF TECHNOLOGY MATHEMATICS

Department of Mechanical Engineering Federal University of Technology, Akure, Nigeria

Budapest University of Technology and Economics. Faculty of Transportation Engineering Department of Automobile Engineering

Kaunas University of Technology, Faculty of Mechanical Engineering and Design, Department of Materials Engineering,

Department of Energy and Process Engineering Faculty of Engineering Science and Technology Norwegian University of Science and Technology

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Engineering Physics and Mathematics

Kimmo Berg A Model of Monopoly Pricing under Incomplete Information

Master’s thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Technology Espoo, 21.09.2004

Supervisor: Professor Harri Ehtamo Instructor: Professor Harri Ehtamo

Helsinki University of Technology Abstract of master’s thesis Department of engineering physics and mathematics Author: Kimmo Berg Department: Department of Engineering Physics and Mathematics Major subject: Systems and Operations Research Minor subject: Logistics Title: A Model of Monopoly Pricing under Incomplete Information Title in Finnish: Ep¨ at¨ aydellinen Informaatio Monopolistisessa Hinnoittelussa Chair: Mat-2 Applied Mathematics Supervisor: Professor Harri Ehtamo Instructor: Professor Harri Ehtamo Abstract: Nash equilibrium defines a solution to strategic interactions among rational players. Deductive equilibrium selection studies players who have all the relevant information in the game and who reason the outcome of the game. But how do the players come about to possess the relevant information and how do they select among possibly many equilibria? One solution to the incomplete information is given by Bayesian Nash equilibrium, where the players do not exactly know the opponents’ preferences but the preferences are chosen probabilistically among some alternatives. On the contrary, inductive equilibrium selection interprets equilibrium as a result of a dynamic adaptive process. Actually, this is one way to motivate the traditional equilibrium theory. We introduce the celebrated learning models: reinforcement, belief-based, and evolutionary-based learning. We develop an adjustment process that extracts information and results in Bayesian Nash equilibrium. This thesis studies incomplete information in a situation, where a monopoly sells a good to a population of buyers. The seller discriminates the buyers by offering a tariff, which specifies a price for each amount of good to be sold. The seller’s task is to design the optimal tariff that maximizes her expected profit in a such way that the buyers can choose the amounts they wish to buy. Traditionally, the seller knows the buyers’ preferences, and the difference between giving all the buyers an individual offer and serving all the buyers with the same tariff is examined. Thus, the incomplete information is the inability to distinguish the buyers. On the contrary, we interpret incomplete information as the seller’s ignorance; the seller is simply unaware of the buyers’ preferences. We suggest that the seller learns the buyers’ preferences by selling the good repeatedly. The seller can evaluate the optimality conditions with the extracted information and adjust the prices so that eventually the conditions are met. The adjustment process can be seen as an iterative solution method under limited information. Number of pages: 99 Department fills Approved:

Keywords: monopoly pricing, incomplete information, learning Library code:

ii

Teknillinen korkeakoulu Teknillisen fysiikan ja matematiikan osasto

Diplomity¨ on tiivistelm¨ a

Tekij¨ a: Kimmo Berg Osasto: Teknillisen fysiikan ja matematiikan osasto P¨a¨ aaine: Systeemi- ja operaatiotutkimus Sivuaine: Kuljetus- ja materiaalitalous Ty¨ on nimi: Ep¨ at¨ aydellinen Informaatio Monopolistisessa Hinnoittelussa Title in English: A Model of Monopoly Pricing under Incomplete Information Professuurin koodi ja nimi: Mat-2 Sovellettu Matematiikka Ty¨ on valvoja: Professori Harri Ehtamo Ty¨ on ohjaaja: Professori Harri Ehtamo Tiivistelm¨ a: Nashin tasapaino m¨ a¨ aritt¨ a¨ a ratkaisun usean rationaalisen toimijan strategisiin vuorovaikutustilanteisiin. Deduktiivinen tasapainoteoria tutkii pelaajia, joilla on k¨ayt¨oss¨a kaikki pelin kannalta tarvittava informaatio ja jotka p¨aa¨ttelem¨all¨a selvitt¨av¨at pelin lopputuloksen. Mutta kuinka ajaudutaan tilanteeseen, jossa pelaajat tiet¨av¨at kaiken informaation ja kuinka he valitsevat mahdollisesti useasta tasapainosta? Bayesilainen Nashin tasapaino tarjoaa er¨a¨an ratkaisun ep¨ at¨ aydelliselle informaatiolle, miss¨a pelaajat eiv¨at tarkalleen tied¨a vastustajiensa ominaisuuksia, mutta n¨ am¨ a ominaisuudet valitaan useasta vaihtoehdosta yhteisesti kaikkien tunteman todenn¨ ak¨ oisyysjakauman mukaisesti. Toisaalta, induktiivinen tasapainoteoria tulkitsee tasapainon dynaamisen, adaptiivisen prosessin lopputuloksena. T¨am¨a onkin er¨as tapa motivoida perinteinen tasapainoteoria. T¨ ass¨ a ty¨oss¨a esitell¨a¨an tunnetut oppimismallit: vahvistus-, uskomus-, ja evolutiivinen oppiminen. Ty¨oss¨a kehitell¨a¨an lis¨aksi uusi mukautumisprosessi, jossa pelaajat oppivat toistensa ominaisuuksista, ja joka johtaa bayesilaiseen Nashin tasapainoon. Ty¨ oss¨ a tutkitaan ep¨ at¨ aydellist¨ a informaatiota tilanteessa, jossa monopoli myy hy¨odykett¨a asiakasjoukolle. Monopoli tarjoaa asiakkailleen hinnaston, joka m¨a¨aritt¨a¨a hinnan eri m¨a¨arille hy¨odykett¨ a. Monopolin optimointiongelma on suunnitella sellainen hinnasto, joka maksimoi tuoton odotusarvon, kun ostajat p¨ a¨ att¨ av¨at itse ostamastaan hy¨odykkeen m¨a¨ar¨ast¨a. Tavallisesti oletetaan, ett¨ a monopoli tiet¨a¨a ostajien ominaisuudet, ja tutkitaan kuinka monopolin tuotto muuttuu kun jokaiselle asiakkaalle ei voida antaa omaa tarjousta, vaan hinnan tulee olla sama kaikille asiakkaille. T¨ am¨a hinta voidaan tulkita ep¨at¨aydelliseksi informaatioksi, miss¨a monopoli ei pysty erottamaan erilaisia asiakkaita. Ty¨ oss¨ a ehdotetaan, ett¨ a monopoli oppii asiakkaittensa ominaisuudet myym¨all¨a hy¨odykett¨a toistetusti. Monopoli tarkistaa optimaalisuusehdot oppimallaan informaatiolla ja muokkaa hintoja siten, ett¨ a optimaalisuusehdot toteutuvat. Mukautumisprosessi voidaan tulkita iteratiivisena ratkaisumenetelm¨ an¨ a v¨ ah¨ aisen informaation vallitessa. Sivum¨ a¨ ar¨ a: 99

Avainsanat: monopolin hinnoittelu, ep¨at¨aydellinen informaatio, oppiminen

T¨ aytet¨ a¨ an osastolla Hyv¨ aksytty:

Kirjasto:

iii

Acknowledgements First of all, I would like to thank the supervisor and the instructor of the thesis, Professor Harri Ehtamo. With his encouragement and support, I got opportunity to work at the System Analysis Laboratory at the Helsinki University of Technology for the summer 2003. This encouraged me to study the subject further. I would also like to thank my co-worker Mitri Kitti for invaluable help, teamwork and provocative ideas. I wish to thank the other workers in the laboratory and my friends who helped me finish the thesis. Finally, I would like to express my gratitude to my brother, Mikko. He inspired me to take a broader view of the subject. Sometimes taking a different perspective and getting to know what is done in other fields of science helps to organize the thesis in a more meaningful way.

Otaniemi, September 2004

Kimmo Berg

iv

Contents Abstract

ii

Abstract in Finnish

iii

Acknowledgements

iv

Introduction

1

1 Solution Concepts and Early Learning 1.1 Nash Equilibrium, Best Response, and Pareto Solutions . . . . . . 1.2 Cournot Duopoly, the First Equilibrium . . . . . . . . . . . . . . 1.3 Adjustment Process . . . . . . . . . . . . . . . . . . . . . . . . . .

6 8 9 10

2 Bayesian Games and a Buyer-Seller Setting 2.1 Bayesian Game . . . . . . . . . . . . . . . . . . 2.2 Games of Incomplete Information . . . . . . . . 2.3 Buyer-Seller Game . . . . . . . . . . . . . . . . 2.4 Mechanism Design . . . . . . . . . . . . . . . . 2.5 Interpretations of Different Buyer-Seller Settings

. . . . .

12 13 15 16 19 21

3 Buyer-Seller Game 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Formulation of the Necessary Conditions . . . . . . . . . . . . . . 3.3 Interpretation of the Necessary Conditions . . . . . . . . . . . . .

23 23 25 28

4 Learning 4.1 Reinforcement Learning . . . . . . . . . . . . . . 4.1.1 History . . . . . . . . . . . . . . . . . . . . 4.1.2 Modern Reinforcement Learning . . . . . . 4.1.3 Subjective Assessments . . . . . . . . . . . 4.2 Belief-Based Learning . . . . . . . . . . . . . . . . 4.2.1 History . . . . . . . . . . . . . . . . . . . . 4.2.2 Stochastic Fictitious Play . . . . . . . . . 4.2.3 Interpretation of Fictitious Play . . . . . . 4.2.4 Bayesian Learning . . . . . . . . . . . . . 4.3 Individual Learning . . . . . . . . . . . . . . . . . 4.3.1 Experience-Weighted Attraction Learning 4.3.2 Rule Learning . . . . . . . . . . . . . . . . 4.3.3 Summary . . . . . . . . . . . . . . . . . .

30 31 32 33 36 37 37 40 41 42 44 45 47 48

v

. . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

4.4 Evolutionary Learning . . . . . . . . 4.4.1 History . . . . . . . . . . . . . 4.4.2 Other Evolutionary Dynamics 4.4.3 Social Learning . . . . . . . . 4.5 Refinements of Equilibrium Concepts

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

51 52 54 55 58

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

62 63 64 66 69

6 Numerical Examples 6.1 One-agent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Two-agent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Six-agent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70 70 70 73

7 Heuristics 7.1 Acquiring Information . . . . . . . . . . . . 7.2 Inexact Solutions . . . . . . . . . . . . . . . 7.2.1 Collective Extracting Method (CE) . 7.2.2 Utilizing the First-Best Solution (FB) 7.2.3 Linear Tariffs . . . . . . . . . . . . . 7.3 Comparison . . . . . . . . . . . . . . . . . . 7.3.1 Six-agent Case . . . . . . . . . . . . 7.3.2 Three-agent Case . . . . . . . . . . . 7.3.3 Two-agent Case . . . . . . . . . . . .

. . . . . . . . .

75 75 77 77 78 78 78 79 80 81

8 Discussion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Adjustment Process as a Learning Model . . . . . . . . . . . . . . 8.3 Uncertainty, Frame Problem and Practical View . . . . . . . . . .

82 82 84 85

A Proof of Proposition 1

87

5 Adjustment Process 5.1 Introduction . . . . . . . . . . . . . 5.2 Highest Type’s Algorithm . . . . . 5.3 Main Algorithm . . . . . . . . . . . 5.4 Price Algorithm and Interpretations

Bibliography

. . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

90

vi

Introduction There are of knowledge two kinds, whereof one is knowledge of fact; the other, knowledge of the consequence of one affirmation to another. The former is nothing else but sense and memory, and is absolute knowledge; as when we see a fact doing, or remember it done; and this is the knowledge required in a witness. The latter is called science, and is conditional; as when we know that; if the figure shown be a circle, then any straight line through the center shall divide it into two equal parts. And this is the knowledge required in a philosopher; that is to say, of him that pretends reasoning. Thomas Hobbes [91, Chapter IX]. But how does a man reason if he does not possess the absolute knowledge. How do we model, for example mathematically, the man’s uncertainty of knowledge. If the man was asked what will the result of a forthcoming coin toss be, the man would probably answer head or tails. In this case, it is natural to model the uncertainty of the man with a probability distribution over the possible outcomes by assigning the probability 0.5 to both events. But what if the man was asked who was the leader of one of the local tribes in California 10000 B.C., and the man knew practically nothing about pre-Columbian America. Now, we cannot predict the man’s answer. It could be “I don’t know”, “Luzia”1, “Arnold”, or whatever. We study this issue in a buyer-seller game, where the seller is uncertain of the buyers’ valuation of good. Until now, this incomplete information has been modeled as external uncertainty2 ; it is assumed that there is a probability distribution over the possible buyers with different valuations. The seller knows this distribu1

A 10,500-year-old skeleton discovered in Brazil is called Luzia. See Kahneman and Tversky [105] and Kahneman et al. [106] for different kinds of uncertainty. Kahneman and Tversky distinguish external and internal uncertainty. External uncertainty (variability) comes from the external world, for example the coin toss. Internal uncertainty (lack of knowledge) results from one’s own mind, ignorance. Fox and Tversky [57] claim that both the degree of uncertainty and its source affect the decision. In comparative study, people prefer a fair coin toss to ignorance (ambiguity); this was already noticed by Ellsberg [48], see also Raiffa [154]. And according to comparative ignorance hypothesis, this difference greatly diminishes as the alternatives are evaluated separately. 2

1

tion, and evaluates the price accordingly. In this study, we model the incomplete information as internal uncertainty, that is, the seller is simply unaware or uninformed of the buyers’ characteristics. We interpret learning as a way of removing this ignorance. Based on the successful use of equilibrium by physicists in the 18th century to explain gravity, planetary motion, and other phenomena of physics, economists of the 19th century adopted the notion of equilibrium to study the economic world. Although Turgot [193] mentions the notion in 1766, Cournot [40] and Walras [202] laid the foundation for the study of equilibrium in economics in the 19th century. Besides this notion of balance, economists are interested in studying how systems evolve outside equilibria, and how these equilibria can be reached. Actually, Cournot studied equilibria resulting from a precise adjustment process, which defines how the system evolves. On the contrary, Walras studied trial-anderror tˆatonnement process3 , which was not as precise as Cournot’s adjustment process. Inspired by these, game theorists study strategic equilibria and learning models that explain the movement towards equilibrium. Lately, learning models are also used to explain the experimental results of the game theory. Thus, providing explanation to human behavior. Learning models are based on different foundations: • belief-based models assume Bayesianism, where players hold subjective beliefs and choose their actions according to these beliefs. • reinforcement models, which arise from psychological and animal tests, increase the propensities of actions that cause positive feedback and decrease the propensities of actions that give negative feedback. • evolutionary models increase the number of individuals that did well in the population. Later on, we will try to interpret our learning in view of different learning models. The adjustment process that we will present is not supposed to explain human behavior in the studied game. Rather, it is an iterative solution method to the problem with very limited initial information. 3

Tˆ atonnement process describes how markets grope their way iteratively towards equilibrium. The process can be interpreted as a fictitious auctioneer who announces a price for each good, and then collects consumers’ orders with these prices. With this information, the auctioneer can adjust the prices: if the demand of a good exceeded the supply, the price of the good is raised; and if the supply exceeded the demand, the price is lowered. This way the process will evolve towards the equilibrium, where the demand equals the supply for each good.

2

The game theory in the early Fifties studied situations in which all the relevant information was known by all the players. Nevertheless, as early as in 1838, Cournot [40] studied a model of duopoly, where the firms choose their production quantities of a certain good simultaneously. Cournot was the first to define the concept of a reaction function, providing an action that maximizes the firm’s profit when the competitor’s production quantity is known. Cournot also noticed that only intersections of reaction functions can be an equilibrium. Actually, Cournot’s approach can be considered as an example of an adjustment process. Cournot’s model of duopoly was inherently dynamic but so was also that of von Stackelberg. In 1934 von Stackelberg presented a model [200], where the firms move sequentially. One firm chooses first and this action is observed by the other before making his choice. Nash [145, 148] laid the mathematical foundations of noncooperative game theory in 1950 by introducing an equilibrium solution concept, known as Nash equilibrium, based on players’ strategies and best responses. Both Cournot and Stackelberg equilibria are Nash equilibria. The next major step in game theory was to formalize the games with incomplete information, where some players have private information, usually about their utility functions, that the other players do not share. The problem of such games is the assumption that each player should be able to plan his moves at the beginning of the game, see [141, p. 68], and the beginning of the game is before the players have learned the types of the other players. Harsanyi [84] proposed in 1967 a new form of game, where the players first know the joint probability distribution of all the types of all the players, and after that start to play the game. The games of incomplete information can be represented with this new form, and these games are called Bayesian games. In the 1990s, Fudenberg and Levine [63, 66] proposed that the equilibrium concept should be defined as a result of long run process of learning. The question of learning is whether the players can learn the Nash equilibrium strategies by playing the game repeatedly. One approach is Bayesian updating, also known as passive learning, where players have prior beliefs about the unknown distribution of opponents’ strategies, they update their beliefs using Bayes rule, and in each period they choose their actions by maximizing the expected payoff given their beliefs. In this case, the play may fail to converge to Nash equilibrium in extensive form games, because the beliefs may be incorrect in information sets that were not reached along the path of play. It has, however, been shown by Fudenberg and Levine [64] that Nash equilibrium can be reached by another approach known as experimenting or active learning. When a player experiments, he chooses his 3

actions, which do not necessarily maximize the current expected payoff given his belief, so that he may learn whether his belief is correct at every information set. In this study, we show that a nonparametric (distribution-free) adjustment process converges to the Bayesian Nash equilibrium in a recurring buyer-seller game. Actually, we motivate the Bayesian Nash equilibrium, because we show that it can be reached even if the utility functions are not known by the seller. Thus, our adjustment process can be compared to Walras’ tˆatonnement process that motivates the general equilibrium. We call the adjustment process nonparametric because we do not model the incomplete information with a probability distribution. We assume that the seller does not know anything about the buyers, ie., the number or probabilities of different buyers. Instead, we shall assume that the seller observes the buyers’ actions at every round of the game. We assume that the buyers behave myopically; they maximize the current round’s utility. This assumption is motivated by a large population model, where one buyer is randomly chosen from a large population to play the game. Hence the buyer will not consider the future while he is unlikely to play the game again in a while. The buyers’ myopic behavior enables the seller to extract enough information from the buyers’ actions in order to get her optimum. The seller evaluates the optimality conditions using the information extracted from the buyers. If the conditions are not met, the seller systematically adjusts her price schedule until the conditions are met. It should be noted that the seller needs to know only certain characteristics of the utility functions in order to learn the solution. In Chapter 1, we introduce the basic solution concept, Nash equilibrium, and the first example of learning, Cournot duopoly. The idea of Cournot duopoly is the base of the adjustment process we shall study later on. In Chapter 2, we discuss the Bayesian approach in analyzing the games with incomplete information. We will see the difficulties of the Bayesian formulation, and these difficulties motivate the iterative approach to the game. Moreover, we introduce the model we will study, the buyer-seller game. In Chapter 3, we derive the Bayesian Nash equilibrium to the problem and the corresponding necessary conditions.

4

In Chapter 4, we discuss the issues of learning and review the recent literature on it. This chapter serves as a base for discussion of the interpretation of the model. In Chapter 5, we present our adjustment process that has similarities with the learning models in Chapter 4, and is based on the optimality conditions presented in Chapter 3. In Chapter 6, we give numerical examples of the adjustment process. In Chapter 7, we compare different heuristics in solving the problem, and present ways of improving the adjustment process. We summarize the results and present further considerations and interpretations in the discussion chapter. A mathematical proof is presented in Appendix A.

5

Chapter 1 Solution Concepts and Early Learning Nash equilibrium1 describes a concept of equilibrium in non-cooperative game theory. Non-cooperative game theory studies strategic interaction between rational decision makers, who maximize their own payoff. The idea of Nash equilibrium can be applied to various fields of science that do not necessarily involve rational decision making; for example, biology, cognitive science, and telecommunication. But why do we study Nash equilibrium, when we know that the assumption of perfect rationality does not hold in real human behavior? According to Myerson [143]: • We lack a better foundation; there is no reliably accurate and analytically tractable theory for inconsistency and foolishness.

• In the long run when the stakes are high, Nash equilibrium describes the human behavior better. • The goal of science is not to predict human behavior, but to analyze social institutions and evaluate proposals for institutional reform.

Why is Nash equilibrium an important concept? Nash defined a “solution” to games with two or more players, and this solution unified the earlier equilibrium concepts. The basic idea behind Nash equilibrium is that no player can benefit from changing his strategy2 while the other players keep their strategies fixed. 1

Nash introduced the concept in [145] and [148], and this originated the non-cooperative game theory. 2 Von Neumann [198] defined a strategy as a complete plan that specifies a move for each player at each possible stage where he is active, as a function of his information at that stage, see Myerson [143].

6

The basic assumptions in game theory are that players are rational and intelligent3 , and all the information in the game is common knowledge. A player is rational if he seeks his own pleasure, that is, he maximizes his utility. An intelligent player is assumed to know and reason everything that we, game theorists, know and reason in a game [141]. An event is common knowledge4 if all the players know the event, know that all the other players know the event, and so on; Fudenberg and Tirole [68, Chapter 1] presents the framework thoroughly. These notions are used in analyzing the play of the game. Nash equilibrium of a stage game, a game that is played only once, can be defined directly, if the previous assumptions are met. If the game is played repeatedly, the players can also try to predict their opponents’ behavior by observing the opponents’ previous moves. Cournot was the first to notice this, and he proposed an adjustment process, where the players are not very foresighted, for they assume that their opponents’ will act as they did in the previous round, and they maximize their utility given this assumption. Cournot adjustment process is an example of naive fictitious play5 . In fictitious play introduced by Robinson [159] and Brown [23], the players assume that the strategies of their opponents are randomly chosen from the unknown stationary probability distribution. In each period, a player forms a belief based on the frequencies of his opponents’ actions, and selects the best response, by maximizing his utility given this belief. In Cournot adjustment, the belief is reduced to the opponents’ previous actions. The adjustment process is naive in the sense that it strategically neglects the future considerations; for example, the players do not think how their own actions affect dynamically their opponents’ future actions. The question of learning is if this kind of behavior converges, and if it does, is the corresponding steady state Nash equilibrium of the stage game. In this chapter, we introduce the basic concepts of game theory, examine a classical Cournot duopoly, and describe an adjustment process. We study the Cournot duopoly, because it is historically the first adjustment process and an example of a learning model. We will utilize the idea of reacting to the previous actions in Chapter 5. 3

Binmore [13, 14] argues that rationality is not as clear concept as it seems, and suggests that consistent would be more appropriate term in game theory. He also mentions that Myerson proposed intelligence as an alternative to rationality. 4 Common knowledge was formulated mathematically by Aumann [2] in 1976. According to Brandenburger [20], the term “common knowledge” was first used in this connection by Lewis [120], who attributes the idea to Schelling [172]. 5 We study fictitious play and different learning models more thoroughly in Chapter 4.

7

1.1

Nash Equilibrium, Best Response, and Pareto Solutions

Before defining Nash equilibrium, we give some definitions in n player game. Let si be a strategy for player i ∈ I = {1, . . . , n}, belonging to a feasible strategy set

Si , si ∈ Si . Furthermore, we denote s = (s1 , . . . , sn ), s−i = (s1 , . . . , si−1 , si+1 , . . . , sn ), and ui(s) is player’s i utility with a strategy s, and this is presented mathematically as a function: ui : ×ni=1 Si = S1 × . . . × Sn = S 7→ R. A normal or strategic form game G is defined as a triplet: G = (I, S, {ui}i∈I ).

Now, a pure strategy6 s∗ = (s∗1 , . . . , s∗n ) is a Nash equilibrium, if for all players i∈I ui (s∗i , s∗−i ) ≥ ui (si , s∗−i ), ∀si ∈ Si .

(1.1)

The Nash equilibrium characterizes the strategies so that if all players predict the same equilibrium, then no individual player has incentive to deviate from it. Because the players are assumed to be rational and intelligent, only a Nash equilibrium can be a prediction of the play of the game. Nash equilibria can be calculated by determining each player’s reaction function BRi (s−i ) = arg max ui (si , s−i ). si ∈Si

(1.2)

The reaction function gives player i’s best strategies against the other players’ strategies s−i , and mathematically BRi : ×k∈I\{i} Sk = S−i 7→ sets(Si ), where

sets(Si ) is the set of all subsets of Si . The Nash equilibrium can be depicted as a fixed point of players’ reaction functions, s∗i ∈ BRi (s∗−i ), ∀i ∈ I.

Table 1.1: Traditional Prisoner’s Dilemma game. The numbers are the duration of sentences in months for player 1 and player 2, respectively. Player 2 Action Do not confess Confess Player 1 Do not confess 1,1 10,0 Confess 0,10 8,8 A traditional Prisoner’s Dilemma7 game is presented in Table 1.1. Two suspects, player 1 and player 2, are arrested and held in separate cells for questioning. The 6

Pure strategy defines a player’s unique action, whereas mixed strategy defines a probability distribution over the set of possible actions in every decision. P The set of mixed strategies could be defined as σi ∈ Σi = ∆(Si ), where ∆(Si ) = {σi : Si 7→ R | si ∈Si σ(si ) = 1, σ(si ) ≥ 0 ∀si ∈ Si } is the set of all probability measures on set Si . 7 The Prisoner’s Dilemma was discovered in 1950 by Melvin Drescher and Merrill Flood. Albert Tucker named it and wrote the first article about it. The dilemma was, however, known

8

prisoners face the same problem, to confess or not to confess, without having the possibility of communicating the other. If neither confesses, both will be jailed for one month. If both confess, they both get eight months. If only one confesses, the confessor will be released and the other will be jailed for ten months. They both notice that it is better to confess regardless of the other’s action. The players’ reactions are underlined, and the game has one Nash equilibrium with pure strategies, (Confess,Confess). We notice that the players could do better off in Prisoner’s Dilemma game if they could agree not to confess. We may argue that a drawback of Nash equilibrium is the fact that the solution is not necessarily Pareto efficient [152]. A solution is Pareto efficient, if there is no other solution that gives all the players at least as good outcome and at least one player better outcome. Note that in Prisoner’s Dilemma game all strategy pairs other than (Confess,Confess) are Pareto optimal. Especially in games describing negotiation situations [137], the parties are interested in choosing a Nash equilibrium among the Pareto solutions.

1.2

Cournot Duopoly, the First Equilibrium

Consider two firms, firm 1 and firm 2, producing a homogenous product in the same market. The game is played once, and the firms make their production decisions simultaneously. Thus, the strategy is the quantity qi ≥ 0 that the firm produces. The production cost of firm i is cqi , c ≥ 0. The market-clearing price P , commonly known by both firms, depends on the total production Q = q1 + q2

linearly, so that P (Q) = a − Q, where a ≥ c. Thus, the utility of firm i is ui(q1 , q2 ) = qi (P (Q) − c). By differentiating, we get that the reaction functions are Ri (qj ) = (a − c − qj )/2. The Nash equilibrium is given by equations q1∗ = R1 (q2∗ ) and q2∗ = R2 (q1∗ ). This gives the solution8 q1∗ = q2∗ = (a − c)/3, see Figure 1.1. The solid lines are the players’ reaction functions. Again, the Nash equilibrium is not Pareto efficient, since, for example, the production quantities (a−c)/4 give a strictly better utility for both players. Hence, we may say that there is a Prisoner’s Dilemma hidden also in this game. earlier. For example, Hobbes [91] argued that without common power men live in a war of every man against every man. A man’s condition in the war was worse than in a society with peace. But some things were needed to build or maintain the society, as prisoners’ agreement is needed for the better outcome in Prisoner’s Dilemma. 8 ∗ The monopolistic quantity, the solution when there is only one firm, is qM = (a − c)/2. ∗ ∗ ∗ Note that q1 + q2 > qM .

9

1.3

Adjustment Process

In fictitious play, players form beliefs of their opponents’ strategies based on the history of the game, and select the best responses given their beliefs. In Cournot adjustment, players are not very foresighted, for they believe that their opponents will act as they did in the previous stage. Players will choose their actions according to their reaction functions and their opponents’ previous quantities. The idea was to find out whether the Cournot adjustment process results in Nash equilibrium9 . Indeed, the outcome is a Nash equilibrium, if the process converges. One path of Cournot adjustment process is shown in Figure 1.1 as a dashed line. a−c

quantity q2

q1=R1(q2)

(a−c)/2 q0 2

(a−c)/3 q =R (q ) 2

0

0

q0 1

q1 1

(a−c)/3 (a−c)/2 quantity q1

2

1

a−c

Figure 1.1: The Cournot adjustment process. The process depicted by a dashed line starts from a star, moves from one reaction curve to another according to the firms’ reactions to their opponents’ previous strategies, and converges to the Nash equilibrium marked by a diamond. In the first period, firm 1 produces quantity q10 . In the second period, firm 2 reacts to this by producing q20 = R2 (q10 ). Then firm 1 reacts to this, q11 = R1 (q20 ). And so on. The idea of reacting to the outcome of previous period can be modified to improve the convergence of the process for games with nonlinear reaction functions. The players may, for example, react to the average of opponent’s previous plays. Similarly, we modify the adjustment process in Chapter 5 to suit better the situation 9

Cournot duopoly was introduced in [40], and it preceded Nash equilibrium by a century. It is debated whether Cournot should get credit for the non-cooperative equilibrium, see Myerson [143, 142]. If only the basic concepts of game theory (strategy [198] and strategic normal and extensive games, and the utility theory [199]) had been known at the time of Cournot, we would be studying Cournot equilibrium.

10

we study. We make a prediction of opponents’ play based on the history of the game. In addition, the adjustment process is enhanced by taking into account the necessary conditions of the problem. A very remarkable property of an adjustment approach is the fact that it alleviates the strict assumption of game formulation in Section 1.1 that all the information is available to all the players. Even if players do not know the opponents’ utilities, they will learn the equilibrium from their opponents’ play [68]. We utilize this idea of adjustment in the method that we present in Chapter 5.

11

Chapter 2 Bayesian Games and a Buyer-Seller Setting Information plays an important role in game theory. The question is not only what players know but what players know about the other players’ knowledge. In this chapter, we introduce the basic concepts of information and the Bayesian game. Harsanyi’s [84] renowned idea was to interpret the games, where players have asymmetric information in the beginning of the game in Bayesian framework. Due to Harsanyi, the games of incomplete information are often called in literature as the Bayesian games, even though they are not games of incomplete information. In Section 2.1, we introduce the Bayesian game, which is a game of complete information and can be solved. In Section 2.2, we explain the problem of incomplete information and how these problems could not be solved. The Bayesian game attempts to model the incomplete information and make it solvable. Under certain assumptions, the Bayesian game is equivalent to the game of incomplete information. Thus, the solution of the Bayesian game applies to the game of incomplete information. Furthermore, in this chapter we introduce a buyer-seller game, which will be studied in this thesis, for literature see Fudenberg and Tirole [68, Chapter 7], Rasmusen [156] and Varian [195]. We will examine the consequences of asymmetric information in the buyer-seller game, and present alternative interpretations to different buyer-seller situations. We discuss whether the Bayesian model is an appropriate description of the actual buyer-seller situation. We also consider other models in which our analysis is appliable. It turns out that the buyerseller game is an example of mechanism design and principal-agent problem with

12

hidden information.

2.1

Bayesian Game

In games of complete information, each player has all the relevant information to play the game, and this is common knowledge. By incomplete information, we mean that some or all of the players lack full information about the “rules” of the game. For example, the players may not know the other players’ or their own utility functions, the strategy spaces, or amount of information the other players have about various aspects of the game situation and so on, see Harsanyi [84]. In Cournot duopoly, the incomplete information could, for example, be the uncertainty of the other player’s cost. This incomplete information can be characterized with probability distribution, for example, c = cH with probability pH for a high cost, and c = cL with probability pL = 1 − pH for a low cost, where it holds 0 ≤ pH ≤ 1.

The type of the player θi ∈ Θi is determined by the player’s private information, where Θi is the possible type set of player i. Thus, the type of player i holds all the information that player i knows and that is not common knowledge. In the Cournot example above, the type set is {cL , cH }. The belief of player i, pi (θ−i | θi ), describes what player i believes about the other players’ types, θ−i ,

given her own type θi . With these notions of information, we state the definition of a Bayesian game, for literature see Fudenberg and Tirole [68, Chapter 6.1] and Myerson [141, Chapters 2.8 and 2.9]. In a Bayesian game, the following are assumed to be common knowledge: • the strategy spaces Si , • the type spaces Θi , • the utility functions ui (s, θ) or ui : S × Θ 7→ R, where S = ×i∈I Si and Θ = ×i∈I Θi ,

• the prior probability distribution p ∈ P ≡ ∆(Θ) = {p : Θ 7→ R | P 1 θ∈Θ p(θ) = 1, p(θ) ≥ 0 ∀θ ∈ Θ} that describes the possible types and their distribution for all of the players2 . We call the distribution p as a common prior. 1

∆(A) is the set of all probability measures on a set A. Also known as common prior assumption (CPA). This controversial assumption has been discussed recently in literature. After Harsanyi’s suggestion, Aumann [2] has lead the discussion on the subject, and many characterizations of CPA has been given: in terms of primitive properties of belief hierarchies by Bonanno and Nehring [19], in terms of “Mutual Calibration” 2

13

• players are Bayesian3 . The n player static Bayesian game proceeds as follows: a) In the beginning of the game, nature draws a type vector, θ = (θ1 , . . . , θn ), according to the common prior p. b) Nature reveals each player i only her own type θi . c) Players choose simultaneously their actions. d) Players receive their payoffs according to their types and to all the players’ actions. When nature reveals the type θi to player i, the player updates her belief using Bayes rule pi (θ−i | θi ) =

X p(θ−i , θi ) = p(θ−i , θi )/ p(θ−i , θi ). p(θi ) θ ∈Θ −i

(2.1)

−i

The Bayesian game is defined as a list G = (I, S, Θ, p, {ui}i∈I ). A strategy s∗ (θ), which defines a strategy for each type, is a pure-strategy4 Bayesian Nash equilibrium (in a finite game), if for all players i and each type θi ∈ Θi s∗i (θi ) ∈ arg max si ∈Si

X

θ−i ∈Θ−i

ui (si , s∗−i | θi , θ−i )pi (θ−i | θi ).

(2.2)

The equilibrium describes strategies so that each type of each player maximizes his expected utility given her belief. And this belief is drawn from the common prior. by Nehring [150], and in terms of i) “frame distinguishability” and ii) a sound and complete axiomatization by Halpern [80]. See the references within for the earlier discussion on the subject. See also Mertens and Zamir [134] for the mathematical construction of the universal belief space. 3 A Bayesian player will assign a subjective joint probability distribution to all variables unknown to him. Once this has been done, he will try to maximize the expectation of his payoff in terms of this probability distribution [84]. See Savage [171] for Bayesianism, or subjectivism. 4 Harsanyi [85] proposed that practically all randomized equilibria of a normal-form game could be interpreted as a pure equilibrium of a similar Bayesian game in which each player has some independent private information, see Myerson [143]. This changed fundamentally the interpretation of mixed strategy equilibria.

14

2.2

Games of Incomplete Information

In this section, we discuss the problem of incomplete information and show that the Bayesian game represent the game of incomplete information if the common prior assumption holds. But first, we introduce more concepts of game theory. In the framework of Cournot’s duopoly model, Stackelberg [200] proposed a sequential game, where the firms move sequentially, one after another. This duopoly model can be interpreted so that one firm, the leader, moves first and the other, the follower, moves then. The sequential game can be represented by an extensive form, or a game tree, which describes the players, the strategies, the order of moves, the information the players have as they make their decisions, and the payoffs in the game. In sequential games, we say that the game has perfect information, if each player knows all the previous moves at the time of making his decision, see [68, Chapter 3.3]. Therefore, perfect information in games refers to observing the other players’ past actions. In games with imperfect information, players do not exactly know or remember the past moves of other players. Harsanyi [83] explains that the problem of incomplete information is an infinite process of taking reciprocal expectations over players’ different order beliefs. In short, a player must estimate her opponent’s characteristics, her opponent’s estimate, her opponent’s estimate of her own estimate, and so on ad infinum. Thus, the problem of not knowing could not be modeled. However, few years later Harsanyi [84] found a solution to the problem by modeling what players do not know. Harsanyi suggested that instead of not knowing anything, a player knows some possible alternatives to her opponent’s characteristics and a probability distribution over these alternatives; and this distribution was common knowledge. Thus, the player and her opponent mutually know what she does not know. Harsanyi [84, Part I] proposed that the original game, having incomplete information and difficulties in analysis, could be transformed into a Bayesian game, having complete but imperfect information5. Harsanyi presumes that the players are Bayesian, and that there is a joint prior probability distribution, which describes the uncertainty in the game. 5

The reader is encouraged to compare the concepts of incomplete and imperfect information in view of the variants of uncertainty described in the beginning of Introduction. In imperfect information, all the players know the probabilities and the number of faces in the dice, whereas in incomplete information, the players might not know the number of faces, or the probabilities can be ambiguous.

15

The transformed game is indeed of complete information, because of the common prior assumption and because all the information is common knowledge. Furthermore, the information is imperfect, because the players will observe only their own types, not the other types, before they choose their actions. Thus, this kind of games could be solved with the existing theory of games with imperfect information. However, before raising the flag as a sign of victory, Harsanyi [84, Section 5] needed to show that the original and the transformed games are identical. Indeed, they can be shown to be game-theoretically identical (Bayes-equivalent), if the structures of the games are identical, and if the beliefs of the players are the same in the games. Namely, the subjective probabilities of the original game must be the same as the conditional probabilities of the transformed game. But they are, because we assumed that there is a common prior, players are Bayesian, and they update their distribution by Bayes rule. Thus, the consistency requirement is satisfied6 : (a) in the beginning of the game, the players have the same information (common prior), and (b) each type is just the conditional probability distribution that can be computed from the prior distribution by Bayes rule.

2.3

Buyer-Seller Game

The game of incomplete information that we will study is a buyer-seller game, where the seller does not know the preferences of the buyers. A seller, a monopolist, produces a good to a market. The seller’s task is to construct a tariff, which specifies a price for the good as a function of quantity, that gives her the best expected profit over possible buyers. First, suppose that there is only one buyer in the market, and the information is complete. To solve the problem, the seller needs to offer the buyer an amount x with a price t. The seller maximizes her profit π(x, t) = t − c(x), where c(x) is the production cost of the good. The buyer accepts the offer, if he gets nonnegative total utility, U(x, t) = V (x) −t ≥ 0, where V (x) gives the buyer’s valuation of the 6

Barelli [4] shows that CPA can be replace by less restrictive action-consistency. Thus, providing one way out of the controversies of CPA. For example, Harsanyi [84] explained in Section 16 (Part III) that the players will agree on the common prior only if the discrepancies among the various players’ probability judgments can be reasonably explained in terms of differences in their information. This was also noted by Aumann [3]. Now, under actionconsistency the differences of beliefs can come from differences in opinion. From epistemological view, the probabilities should not indeed be based only on information. Another way out of the strict CPA is presented by Kajii and Ui [107] with the help of multiple priors.

16

good. The optimal price is t∗ = V (x∗ ), and hence the optimal amount is given by V ′ (x∗ ) = c′ (x∗ ); the amount that satisfies this condition is called the socially optimal amount. With certain assumptions, there is a unique solution to seller’s problem in linear tariffs of the form t(x) = ax + b, where b is a fixed fee and a is a unit price, and the tariff is a joint tangent to the production cost c(x) and the utility function V (x), see Ehtamo et al. [46]. The optimal linear tariff is presented in the Fig. 2.1.

price t

t*

x* quantity x

Figure 2.1: The unique optimal linear tariff is a joint tangent. The thick solid, the solid, the dashed, and the dotted lines are the optimal tariff, the principal’s and the agent’s indifference curves, and the Pareto-optimal solutions, respectively.

Second, suppose that there are two buyers, a low and a high consumer type. The information is complete, and the seller observes the buyer’s type. The seller serves both types of buyers and constructs a distinctive offer for each type. The solution is similar to the previous one; each type gets the socially optimal amount and the price is the buyer’s valuation of the good, Vi′ (x∗i ) = c′ (x∗i ), t∗i = Vi (x∗i ) , i = L, H. This is the so-called first-best solution; the seller gets the maximal profit. Now, suppose the information about the type is incomplete. The buyers’ type, the private information, is their utility function Vi (x), i.e., how much they appreciate the good. The game can be modified into situation where the seller knows the possible types and their distribution, but does not observe the current buyer. The solution in a two-buyer case is such that the seller offers a tariff that maximizes her expected profit over her beliefs about different types subject to the constraint

17

that each buyer maximizes his utility. The solution7 is given by VH′ (x∗H ) = c′ (x∗H ), pL (VL′ (x∗L ) − c′ (x∗L )) + pH (VL′ (x∗L ) − VH′ (x∗L )) = 0, t∗L = VL (x∗L ), and t∗H = t∗L + VH (x∗H ) − VH (x∗L ), where pi is the type i’s probability of arrival. This is the so-

called second-best solution, which gives obviously less profit than the firstbest solution due to the asymmetry of information. The difference is that the seller cannot give each type an individual offer, but must serve the both buyers with the same tariff since she cannot observe the buyer’s type. Suppose the seller offered the first-best solution to both buyers. The high type would not take the offer intended to him, which would give him zero utility, but the other offer, which would give him positive utility. In fact in the second-best solution, the amount that is sold to the low type is smaller and the prices are lower compared to the first-best solution. Actually, the high type benefits from the informational advantage. In both solutions, the seller gives two offers to the buyers: a smaller and cheaper offer for the lower type and a bigger and more 1∗ ∗ 1∗ expensive offer for the higher type. The first-best ((x1∗ L , tL ),(xH , tH )) and the 2∗ ∗ 2∗ second-best ((x2∗ L , tL ),(xH , tH )) solutions are presented in Fig. 2.2.

1 H

t *

2 H 2 tH 1 tL* 2 t * L

price t

t *

2

xL*

1

xL* quantity x

xH*

Figure 2.2: The first-best and second-best solutions. The solid, the dashed, and the dash-dotted lines are the buyer’s, low type’s, and high type’s indifference curves, respectively. Notice that (x∗H , t2H ) is sold to the high type, if the first-best amounts are sold and the types are not observed. This does not clearly give very good profit, for the price t2H is very low.

7

This is derived in Chapter 3.

18

The seller’s cost of incomplete information is due to the fact that the seller cannot observe the buyer’s type. The buyers’ informational advantage can be measured by information rent, which is the difference of prices in the first-best and the second-best solutions. For example, the high type’s information rent due to the low type is VH (xH ) − tH = VH (xH ) − (tL + VH (xH ) − VH (xL )) = VH (xL ) − VL (xL ), where the first and the second equalities follow from the fact that the high type is indifferent between the two offers, and the low type is charged his valuation of good, respectively. The information rents and an optimal piecewise linear tariff are presented in three-agent case in Fig. 2.3.

tH*

price t

tM*

t * L

0

∆1 ∆

2

xL*

xM* quantity x

xH*

Figure 2.3: The optimal piecewise linear tariff and the information rents in a three-agent case. The thick solid, the solid, the dashed, and the dash-dotted line are the optimal tariff, the lowest type’s, the middle type’s and the highest type’s indifference curves, respectively. ∆1 and ∆2 are the information rents of the middle type due to the lowest type and the highest type due to the middle type, respectively.

2.4

Mechanism Design

In a buyer-seller game, the seller’s tariff design problem is an example of mechanism design, the art of designing the rules of the game so that a desirable outcome is chosen and the agents are motivated to report their preferences truthfully [39]. Mechanism design is usually a three-step game of incomplete information, see Fudenberg and Tirole [68, Chapter 7] and Myerson [140]:

19

a) The designer offers a mechanism to the other players. A mechanism is a game in which the players send costless messages, where the agents report their possibly incorrect types, and the outcome depends on these messages. b) The other players simultaneously accept or reject the mechanism. c) Those who accepted the mechanism play the game accordingly. The designer must give players an incentive to report their types truthfully, a set of constraints known as incentive-compatibility (IC), U(θi ) = max U(θˆi | θi ), ∀θi ∈ Θi , ∀i ∈ I.8 θˆi ∈Θi

The incentive-compatible mechanisms are called implementable mechanisms. Moreover, implementable mechanisms that satisfy the individual rationality (IR) constraints, U(θi ) ≥ 0, ∀θi ∈ Θi , ∀i ∈ I, are called feasible. The IR and IC constraints are introduced in the buyer-seller context in Chapter 3, where the game is analyzed. We are interested in implementable mechanisms because of revelation principle. The principle states that any Bayesian Nash equilibrium of any Bayesian game can be implemented by an incentive-compatible direct mechanism. Direct mechanisms are those, where the message space is exactly the type space. The key idea of revelation principle is the fact that we may concentrate on designing an implementable mechanism, whose message space is the type space. Thus, in buyer-seller game we need to design an offer for each type noticing that each type should prefer his own offer. Mechanism design, also known as adverse selection9 , see [5, 138]10 , is a principalagent problem [76]. In principal-agent problem, a principal contracts an agent to act on her behalf, see e.g. [161]. In economic problems, the agent is not necessarily an employee of the principal. The principal is the party that proposes the contract and the agent accepts the contract and chooses his action according to contract, or rejects the contract. Usually, the principal is the party that has less information, sometimes called the uninformed player, and the agent is the informed player. If U (θi | θi ) < U (θˆi | θi ), type θi has incentive to report his type deceitfully θˆi . The adverse selection problem was introduced in [1], where Akerlof studied a car sales problem. The hidden information was the quality of the car. 10 Mussa and Rosen studied price discrimination by a monopolist in a quality differentiated product market, where the consumers have different tastes for quality. Baron and Myerson studied how to regulate a monopolistic firm, whose costs are unknown to the regulator. 8

9

20

The principal-agent problems can be divided into two classes11 [122, 164]: • adverse selection describes problems of hidden information. In these problems, the agent has private information, and the principal needs to provide incentive for the agent to reveal the information. The buyer-seller game is an example of adverse selection. • moral hazard describes problems of hidden action. In these problems, the principal cannot observe the agent’s action directly, but needs to come up with a contract, where the agent chooses the action that is best for the principal. Consider, for example, a worker producing a good, and the principal cannot observe his effort. The principal can, however, observe the output, which depends on the effort and an exogenous stochastic variable. The parties may have different beliefs about the stochastic part.

2.5

Interpretations of Different Buyer-Seller Settings

First, let us study stage games, the games that are played once. With complete and perfect information, the principal conditions the tariff on the agent’s type. In multi-agent model, one of the agents is chosen, and the principal constructs an individual offer for him after observing the type. This is the first-best solution. With complete but imperfect information, the principal does not observe the agent’s type. This is the adverse selection problem that we study. The game can be seen as a Bayesian game, where the principal knows the possible types and has belief over them. The principal constructs a tariff that maximizes her expected profit over her belief. This leads to the second-best solution. The model can be interpreted as one-agent model, where the principal has uncertainty over the type, or as a multi-agent model, where the principal has to offer the same tariff to all the agents12 . The multi-agent model may differ from the one-agent model, if all the agents act on the same round. Then, the probabilities of arrival are interpreted as fractions of P population, and the cost depends on all the agents’ choices, for example, c( i xi ). However, we will not consider this model any further. 11

Some problems may well belong to both of these classes. A law may oblige the seller to offer all the buyer the same tariff without discriminating anyone. 12

21

Now, consider a situation, where the principal does not know the possible utility functions and the game is repeated. The Bayesian model with the principal updating his belief does not describe the situation sufficiently13 . The question is whether there is a parameterization for the type. And if there is, can the principal know it and can it be learned. Moreover, it makes a big difference whether there is only one agent or many. We propose an alternative to the Bayesian approach in modeling incomplete information. In this case, the private information is the whole utility function14 . The principal cannot model the uncertainty with a finite dimensional parameter. We suggest that the principal learns the agents’ utilities over time. We continue to develop the idea in Chapter 5.

13

We are not implying that Bayesian modeling should be forgotten. The Bayesian model depicts analytically the effect of information. The learning models and heuristics are, however, associated with the real situation, describing how people, or firms, behave. 14 Carlier [36] studied nonparametric approach in adverse selection. In the problem the type was the whole utility function, which could not be reduced to some finite-dimensional parameter.

22

Chapter 3 Buyer-Seller Game We study more thoroughly the buyer-seller game (Bayesian game) that was presented in the previous chapter. We simplify the problem by making assumptions on the agents’ utility functions, reservation utilities, and the distribution of types. With these assumptions, we develop sufficient conditions to the principal’s problem. The learning method that we shall present in Chapter 5 is based on these conditions. The first observation is that with n agents, the principal needs to give at most n offers to the agents to implement the optimal tariff. Second, the optimal prices depend on the optimal amounts, given certain assumptions. Thus, the principal’s problem is simplified to finding at most n amounts. Third, the amounts in the second-best solution are bounded by the first-best solution. This characteristic of the game can be exploited later on in the learning method, when the principal does not know the possible types, but knows the first-best solution, which can be found more easily.

3.1

Model

A principal, a monopoly, sells an amount x ≥ 0 of the good to the agents. The game can be seen as a two-staged game. In the first stage, the principal offers a tariff t(x), which gives the price of the good as a function of the amount, to the agents. In the second stage, one agent is randomly chosen among all the agents. We assume that there is a discrete amount of different types of agents indexed with numbers {0, 1, . . . , H} = J, and the principal knows the distribution of the types. The agent can either accept or reject the proposed tariff. If he rejects the 23

tariff, he obtains a reservation utility and otherwise he chooses the amount he wishes to consume. We suppose that the production cost of the good is increasing and convex in amount, and twice continuously differentiable, i.e., c′ (x) ≥ 0, c′′ (x) > 0, c′ (0) = 0.

The principal’s task is to construct a tariff without knowing which type is chosen. Given a tariff t(x), the agent i ∈ J chooses amount xi and the outcome of the game is one of the price-amount bundles (xi , ti ), where ti = t(xi ), if the agent accepts the tariff. Actually, the principal can design optimal bundles instead of the whole tariff t(x); this is sometimes called the bundle approach. The

sufficient number of bundles can be less than the number of types. This will happen if it is profitable to serve only some of the highest class of buyers, or if some of the buyers are served with the same bundle; this is known in literature as bunching or pooling. From now on, the optimal amounts and prices refer to the bundle solution of the principal’s problem. The principal is maximizing the expected profit π(x, t) =

X i∈J

pi [ti − c(xi )],

(3.1)

where x = (x0 , . . . , xH ), t = (t0 , . . . , tH ), and pi is the probability of arrival for type i. The agent i enjoys a gross consumer surplus Vi (xi ) by consuming amount xi of the good. The agent’s private information is now the whole utility function Vi (x). The gross surplus is increasing and concave in amount, and twice continuously differentiable, Vi′ (x) ≥ 0, Vi′′ (x) < 0. In addition, the gross surplus and its slope are increasing in type index, Vi (x) − Vk (x) > 0, Vi′ (x) − Vk′ (x) > 0, when x > 0 and i > k. With a given tariff, the agent i chooses the amount by maximizing his total utility Ui (x). We suppose that the function Ui , i ∈ J, is of the form Ui (x) = Vi (x) − ti .

(3.2)

This form of utility function is called quasi-linear, which means that the agent’s valuation of the good does not depend on the price of the good. In designing the tariff, the principal faces two kinds of constraints. The price of the good must be low enough so that the agent is willing to make the purchase. Remembering the bundle approach, this can be formulated with the help of reservation utility. The reservation utility is the utility experienced by the agent, if he rejects the principal’s offer and does not buy any good. The reservation utility may vary respect to the type [103], but in this case we assume the reservation

24

utility to be the same for each type. Without loss of generality, we assume the reservation utility to be zero. Now we may state so-called individual-rationality (IR) or (voluntary) participation constraints Vi (xi ) − ti ≥ 0, ∀i ∈ J.

(3.3)

The constraints state that, without any bargaining power [99], the agent accepts the constraint, if accepting does not entail negative utility1 . The second type of constraint is about agents’ rationality. Because the principal cannot observe the agent’s current type and offers all the bundles, the agent may take any bundle he wishes. If the principal offers bundles (xi , ti ), i ∈ J, she wishes each agent to choose the bundle intended for his type. From this requirement, we obtain incentive-compatibility (IC) constraints Vi (xi ) − ti ≥ max [Vi (xj ) − tj ] , ∀i ∈ J. j∈J\{i}

(3.4)

The IC constraints are also known in literature as self-selection, revelation and truth-telling constraints. The IC constraints induce the adverse selection problem or the second-degree price discrimination. Now we can formulate the principal’s problem as maximization of the expected profit, Eq. (3.1), over the price-amount bundles that satisfy the IR and IC constraints, Eqs. (3.3) and (3.4).

3.2

Formulation of the Necessary Conditions

Before we proceed to solve the principal’s problem, we need to present few important assumptions. Besides the earlier assumptions, we assume that the utility functions can be sorted, so-called single-crossing condition2 [135], Vi (x2 ) − Vk (x2 ) > Vi (x1 ) − Vk (x1 ) ∀x2 > x1 , i > k. The single-crossing condition means that the agent’s marginal utility from increasing the amount is increasing in type. We guarantee this condition by assuming that Vi+1 (x) > Vi (x), ′ Vi+1 (x) > Vi′ (x), i = 0, . . . , H − 1.

With these assumptions, we can simplify the principal’s problem; the simplification was introduced by Baron and Myerson [5], see also Mussa and Rosen [138]. 1

One might argue that the agent has no bargaining power in principal-agent problems, because he can only accept or reject the offer. Later on in Chapter 5, we refine the role of the agent by introducing large population models. 2 Also known as Spence-Mirrlees condition, due to Spence’s [182, 183] and Mirrlees’ [135] contributions on the subject.

25

Namely, we can find the equations for the optimal prices and reduce the original problem to finding only the optimal amounts to be sold. To find the equations, we first collect the required properties of the problem to Proposition 1. These properties have been previously studied in discrete case by Maskin and Riley [124] and continuous case, for example, by Fudenberg and Tirole [68]. Some parts of the proof are identical to the literature, but some parts, a) and e), are proven slightly differently. ′ Proposition 1 The assumptions Vi+1 (x) > Vi (x), Vi+1 (x) > Vi′ (x), i = 0, . . . , H − 1, and constant reservation utility imply the following facts.

a) At least one of the constraints is binding for each price. b) At least one of the IR constraints is binding. c) The only binding IR constraint is the one for the lowest type, t∗0 = V0 (x∗0 ). d) The IC constraint is not binding for the lowest type. e) The IC constraints mean that each type is indifferent between his and the previous type’s bundle, if the amounts are increasing in type, x∗i > x∗i−1 . f ) The amounts are nondecreasing in type, x∗i+1 ≥ x∗i . See Appendix A for the proof. The fact that the IR constraint holds only for the lowest type results from the assumptions of constant reservation utility and increasing utility in the type number. The fact that each type is indifferent to the previous type’s bundle results from the single-crossing condition. By putting the results together, we get the optimal prices as a function of the amounts xi , i ∈ J, t∗0 = V0 (x0 )

(3.5)

t∗i = t∗i−1 + Vi (xi ) − Vi (xi−1 ), ∀i ∈ J \ {0}.

(3.6)

These prices are optimal in general, if the lowest type is served, x0 > 0, and if the amounts are increasing in type, xi+1 > xi , ∀i < H, i ∈ J, see Fig. 3.1. Now we can utilize the optimal prices, Eqs. (3.5) and (3.6), to reduce the problem dimension, and forget the constraints. The principal needs to find the optimal

26

tH*

price t

tM*

tL*

xL

xM quantity x

xH

Figure 3.1: Illustration of optimal prices. The thick and the dashed lines are the optimal price function and the different agents’ indifference curves, respectively. amounts to be sold max π(x, t) = max p0 [t0 − c(x0 )] + x,t

x,t

X

i∈J\{0}

pi [ti − c(xi )]

= max p0 [V0 (x0 ) − c(x0 )] x # " i X X {Vs (xs ) − Vs (xs−1 )} − c(xi ) . + pi V0 (x0 ) +

(3.7)

s=1

i∈J\{0}

The necessary conditions of the maximization problem are3,4 deviation from first-best

}| { z pi [Vi′ (x∗i ) − c′ (x∗i )] +

"

H X

pk

k=i+1

#

information rent

{ z ′ ∗ }| ′ ∗ Vi (xi ) − Vi+1 (xi ) = 0 ∀i ∈ J.

(3.8)

PH For the highest type, we define that k=H+1 pk = 0. Equally, we could say that VH′ (x∗H ) − c′ (x∗H ) = 0. 4 Now, we can see the idea of single crossing property. The solution of the problem is given by very symmetric conditions instead of the maximization problem with some IR and a lots of IC constraints. In view of the adjustment process that will be presented in Chapter 5, we may use these conditions to calculate the optimal amounts in the bundles. Actually, the adjustment process can be seen as an iterative method to solve this kind of problem. 3

27

3.3

Interpretation of the Necessary Conditions

By examining the necessary condition, we get upper limits to the optimal amounts for all other than the highest type Vi′ (x∗i ) = −

"

H X

pk

k=i+1

#

′ Vi′ (x∗i ) − Vi+1 (x∗i ) /pi + c′ (x∗i ) > c′ (x∗i ),

(3.9)

which means that the optimal amounts are less than in the socially optimal solution. The reason for this is the fact that the information rent decreases by decreasing the amounts in the smaller bundles. Thus, the optimal amounts are not arbitrary, but each between zero and the upper limit. On the contrary, the optimal amount to the highest type is the same as in the first-best case, as was seen in Fig. 2.2. It is instructive to compare Eqs. (3.6) and (3.8) with the prices and equations in a case with continuum of types, p. 265 in [68]. The two objectives of the principal can be seen in the necessary conditions. Firstly, the principal wants to maximize the profit of type i, Vi (xi )−c(xi ), for which the solution is given by Vi′ (xi )−c′ (xi ). Secondly, the principal minimizes the information rent of type i + 1 due to type ′ i, Vi+1 (xi ) − Vi (xi ), for which the solution is given by Vi+1 (xi ) − Vi′ (xi ). These two objectives are weighted in the necessary condition by the probabilities of arrival; the profit is weighted by pi and the information rent by the higher types’ P probabilities of arrival, H k=i+1 pk . In view of this, the necessary conditions are reasonable and intuitively verifiable.

Notice that the necessary conditions of the problem, Eq. (3.8), may give decreasing amounts in type with some probabilities of arrival. Consider, for example, a case with three types of customers. The distribution of types is such that the size of the middle class is very small. The necessary conditions and prices in Eqs. (3.6) and (3.8) will not produce the optimal bundles, because the assumptions of the proof of Prop. 1.e) are not satisfied. The necessary conditions give that it is profitable to sell smaller amount to the middle type than to the lowest type. The IC constraints that were assumed to be satisfied all the time are violated.

28

In the continuous case, one may guarantee that the optimal amounts are increasing in type is to require the monotone-hazard-rate condition for the probability distribution, see p. 267 in [68]. The condition is strict, because there are distributions that do not satisfy the condition, but still give a feasible solution. In the discrete case, we could replace this condition by adding constraints xi ≥ xi−1 to the principal’s optimization problem. From now on, we assume that the necessary condition in Eq. (3.8) give the optimal solution. This is achieved, for example, by assuming that the probabilities of arrival are favorable.

29

Chapter 4 Learning In our model, we relax the assumption that the principal knows the agent’s possible utility functions. Instead, the principal serves a population of agents repeatedly, and learns enough information to construct the optimal tariff. Learning in game theory is, however, much more than eliciting preferences. Learning describes: • experimental behavior. Camerer [31, Chapter 6] defines learning as an observed change in behavior due to experience. We examine individual learning models in Sections 4.1-4.3 and population learning, which describes aggregate behavior, in Section 4.4. Individual learning models differ in the level of rationality, needed information, update rule, and choice rule, see Table 4.1. • how equilibrium is reached. Learning models provide an alternative explanation to how and what equilibrium will be played. In Section 4.5, we study equilibrium refinements that explain which equilibria are self-enforcing in a given situation. • iterative solution method. Fictitious play was originally represented as an iterative solution method to zero-sum games and later on to Nash equilibrium. Our model is also an iterative solution method to a game with very limited information. Two basic learning models are introduced in Sections 4.1 and 4.2: reinforcement learning and belief-based learning. Though, they are thought as separate learning models, we show in Section 4.3 that they can be presented with one learning model. In these sections, we introduce various aspects that learning models may or should have. We also present conditions or requirements that the learning models should satisfy. 30

Table 4.1: A classification of basic learning models. Behavioral learning corresponds animal behavior that requires minimal rationality, and cognitive learning corresponds human behavior with more sophistication. Experience-weighted attraction model includes all these models. Choice rule Deterministic Stochastic Level of rationality Behavioral Sarin-Vahid model Reinforcement Learning Cognitive Fictitious Play Stochastic Fictitious Play Learning models describe very different strategic situations. Reinforcement learning describes behavioral learning in complex situations, where the players are not very sophisticated, because they may not simply have enough information or because of the cost of calculation. Belief-based learning describes cognitive learning, where the players optimize against the opponents’ empirical distribution of actions. The players naively think that they play against static environment, while they do not notice others’ payoffs or the consequences of their own actions. Sophisticated learning models add more rationality, because players notice that the others are also learning. Rational learning describes the hyperrationality, where the players can determine when to experiment and predict the future play of the game.

4.1

Reinforcement Learning

Reinforcement learning is founded on psychological evidence; human behavior favors actions that yield positive result and inhibit actions that yield negative result. This behavioral “rule” was noticed in animal and psychological tests in the first half of 20th century, and the idea was brought to mathematical modeling in the Fifties. As the experimental economics arose in the Nineties, reinforcement learning was used to explain the human behavior in experimental games, where players choose their actions so that they improve their payoffs. As we will notice, the idea of reinforcement can be mathematically modeled differently. Reinforcement learning is studied in various fields of science, for example, artificial intelligence, machine learning, artificial neural networks, dynamic optimization, and adaptive control (learning automata). In artificial neural networks, reinforcement learning is classified as a kind of supervised learning, where an agent gets a response from the environment after choosing an action. And this reinforcement signal changes the agent’s behavior. Reinforcement learning is studied in a computer-science perspective in Kaelbling et al. [104]. 31

4.1.1

History

Thorndike [189, 190] formulated law of effect, sometimes called principle of reinforcement, in 1898, which stated that an act that produces satisfaction in a given situation becomes associated in that situation and when the situation recurs, the act is likely to recur, see also Herrnstein [88]. Thorndike’s contribution was very influential in the early psychological science, especially connectionism, where the approach is direct application of associationism. Thorndike did research with animals and planned tests with children, but they were prohibited. Skinner [179, 180] developed the idea much further by proposing operant conditioning, which forms an association between a behavior and a consequence1 . Operant conditioning is opposed to earlier classical or Pavlovian conditioning, which describes associative learning in which there is no contingency between response and reinforcer. According to Salmon [165], reinforcement models trace their origins to the psychological learning theories of Thurstone [192], Estes [53], Luce [121], and Bush and Mosteller [27]. Stochastic models of learning were introduced by Estes [53] who studied stimulus sampling theory, see also Estes and Burke [54], and by Bush and Mosteller [25, 26, 27]. Bush-Mosteller reinforcement rule is defined as follows. If player n takes an action k at time t, then the propensity qn,k is updated qn,k (t + 1) =

(

qn,k (t) + α(1 − qn,k (t)), if outcome is successful qn,k (t) − βqn,k (t),

if outcome is a failure

,

(4.1)

where α, β ∈ (0, 1] represent the speed of learning or adaptation, given a success-

ful outcome or a failure. The outcome is successful, if the payoff πn (t) exceeds an aspiration level an (t), see Simon [178] and Gilboa and Schmeidler [72], and otherwise, the outcome is a failure. In Luce γ-model [121], the new propensity is the sum of γk and the product of the old propensity and a factor βk qk (t + 1) = βk qk (t) + γk ,

(4.2)

where both βk and γk are functions of the action k and the corresponding payoff. The probability of choosing action k, pk , is determined in Luce’s model by linear 1

Reinforcement learning is sometimes called stimulus-response learning, because it forms an association between animal’s response and the stimulus that follows. Selten [176] calls it rote learning because only success and failure influence the choice probabilities.

32

response rule qk (t) , pk (t) = P j qj (t)

(4.3)

which simply normalizes the propensities.

In the Seventies, Cross [41, 42] modeled learning in economic decisions and modified the Bush-Mosteller model. In Cross model [41], also studied in B¨orgers and Sarin [29], the probability of choosing action k, pk , is updated pk (t + 1) = pk (t) + a(π(t))[1 − pk (t)],

(4.4)

where a(π) is some monotonically increasing function of the payoff magnitude π. Since, by the nature of probabilities, 0 ≤ pk (t + 1) ≤ 1, we must have 0 ≤ a(π) ≤ 1 as well. By following Bush-Mosteller model in assuming that when pk (t) is increased, the probabilities of all other outputs are reduced in proportion to their present levels pj (t + 1) = pj (t)[1 − a(π)] ∀j 6= k.

4.1.2

(4.5)

Modern Reinforcement Learning

In this section, we study Roth-Erev model, which is maybe the most famous modern reinforcement learning model. Besides, we study how B¨orgers and Sarin [30] included evolving aspiration to their model. This issue is important in view of modeling bounded rationality. Simon [178] claims that decision makers have limits in making fully rational decisions due to uncertainty. Instead of maximizing, the agents rather satisfy their needs by setting aspiration level which separates successful and unsuccessful decisions, see also Fudenberg and Levine [66, p. 67]. If the aspiration level is not achieved, the agents try to change either their aspiration level or their decision. Bush-Mosteller model obeys the law of effect, but it neglects some psychological and other facts: • The power law of practice [15] states that learning curves tend to be steep initially and then become flatter. Roth and Erev [162] notice this law with cumulative propensities [82, 119]. • The law of recency [204] states that the response that has most recently

occurred after a stimulus is most likely to be associated with the stimulus. Roth and Erev [162] use a forgetting, or recency [52], parameter to notice 33

this law. • Experimentation [60, 66] depicts human behavior and allows the players

to gain “enough” information. Roth and Erev [162] add an experimentation parameter to guarantee persistent local experimentation or error.

• Evolving Aspirations is considered in Karandikar et al. [115] and in B¨orgers and Sarin [30]. • Sarin and Vahid [169] use expected payoff instead of cumulative propensities. Players form subjective payoff assessments by generating approximates for the expected payoff using past payoffs [136].

• Rustichini [163] considers the informational aspects of the reinforcement. • Gentzkow’s [70] updated propensities depend also on future payoffs. • Laslier et al. [119] considers efficiency conditions of reinforcement rules.

In Roth-Erev basic model [162], each player n has an initial propensity qn,k (1) to play his pure strategy k. If player n plays pure strategy k at time t and receives a payoff x, then the propensities are updated qn,k (t + 1) = qn,k (t) + x qn,j (t + 1) = qn,k (t), j 6= k

(4.6)

The probability that player n plays pure strategy k at time t, pn,k (t), is qn,k (t) 2 pn,k (t) = P . j qn,j (t)

(4.7)

The basic model, corresponding both Bush-Mosteller and Cross models, obeys the law of effect, but Roth and Erev made some modifications: • Extinction in finite time. They added arbitrary “cutoff” probability µ to simply truncate probabilities that fall below the limit µ to zero. Whenever pn,k (t) ≤ µ, they set pn,k (t) = qn,k (t) = 0.3 • Persistent local experimentation or error. ǫ proportion of payoff x is distributed to “adjacent” strategies. In global experimentation, this proportion is distributed over all pure strategies. If there are m adjacent strategies, 2

Notice that this equation corresponds Luce’s linear response rule. Erev and Rapoport [51] suggest quite the opposite; they add a technical parameter v > 0 to ensure that the propensities remain positive. 3

34

the update rule is qn,k (t + 1) = qn,k (t) + (1 − ǫ)x

qn,j (t + 1) = qn,j (t) + qn,j (t + 1) = qn,j (t),

ǫx , m

j adjacent to k . otherwise

(4.8)

• Gradual forgetting. At the end of each period t, each propensity qn,j (t) is multiplied by 1 − φ. The introduction of psychological facts, experimentation and recency, improves the model’s descriptive and predictive power by facilitating responsiveness to a changing environment. The law of effect is not sufficient to describe learning in a game-theoretic situation, where players need to adapt to the changing environment. It is also worth noticing that Roth-Erev model incorporates consideration of memory, whereas the earlier models feature complete loss of memory. The earlier rules depend only on current choice probabilities and last payoff, and thus they can be visualized as Markov chains. Erev-Roth model [52] replaces the payoff increase with a reinforcement function R(x) = x − xmin , where x is the received payoff and xmin is the smallest

possible payoff. They also introduce a generalization function E to represent the experimentation. The update is then given by qn,j (t + 1) = (1 − φ)qn,j (t) + Ek (j, R(x)),

(4.9)

where φ is a forgetting parameter, and E is a function which determines how the experience of playing strategy k and receiving the reward R(x) is generalized to update each strategy j Ek (j, R(x)) =

(

R(x)(1 − ǫ), if j = k R(x)ǫ/(M − 1), otherwise

(4.10)

where ǫ is an experimentation parameter, and M is the number of pure strategies. B¨ orgers-Sarin model [30] adds an endogenous aspiration level to steer the decision maker towards rationality and to affect the decision maker’s long run performance. The issue is also studied by Karandikar et al. [115], who assume that players use pure strategies and that the aspiration level moves occasionally at random. Instead, B¨orgers and Sarin study players using mixed strategies, and their model does not have such trembles. In B¨orgers-Sarin model, the set of possible states of world Ω is finite, the payoffs 35

are scaled to open unit interval 0 < π < 1, and at time t the decision maker has an aspiration level a(t) ∈ [0, 1]. An aspiration formation rule is a pair of an initial aspiration level a(0) and a fixed parameter β measuring the speed of adjustment. In case of a success, if π(t) ≥ a(t), the learning rule is pk (t + 1) = (1 − α)pk (t) + α pj (t + 1) = (1 − α)pj (t), j 6= k ,

(4.11)

a(t + 1) = (1 − β)a(t) + βπ

where α = |π(t) − a(t)|. And in case of a failure, if π(t) ≤ a(t), the learning rule is

pk (t + 1) = (1 − α)pk (t) pj (t + 1) = (1 − α)pj (t) + α, j 6= k .4 a(t + 1) = (1 − β)a(t) + βπ

4.1.3

(4.12)

Subjective Assessments

Instead of cumulative propensities, Sarin and Vahid [169, 170] assume that players choose their strategies according to their estimates for expected payoffs, see also Sarin [168] and Vostroknutov [201]. Each player explicitly has belief5 that is scalar-valued and non-probabilistic. These beliefs represent the player’s subjective assessment regarding the payoff the player would obtain from the choice of any strategy at any time. The beliefs are not about how the other players are likely to play but rather they reflect the payoffs they expect from different strategies. Another difference to reinforcement learning is that in Sarin-Vahid model the decision is not probabilistic, but deterministic. At each time, the player chooses the strategy that he assess to have the highest payoff. The player is assumed to be myopic; he does not take into account the future decisions he may face. The Sarin-Vahid model assumes finite number of strategies sj ∈ S and finite

number of states of the world ω ∈ Ω. The subjective assessments of strategy j at time t is uj (t). If the player chose strategy sk at time t and the state of the 4

In Karandikar et al. [115], aspirations are formed deterministically with probability 1 − η: a(t+1) = λa(t)+(1−λ)π(t), where λ ∈ (0, 1) is a persistence parameter corresponding B¨orgersSarin parameter 1 − β. With probablity η, the updated deterministic aspiration a is perturbed according to some density g(., a). 5 Thus, Sarin-Vahid model belongs somewhere between reinforcement and belief-based learning.

36

world was ω, the player updates her subjective assessments uk (t + 1) = [1 − λk (t)]uk (t) + λk (t)πk (ω) uj (t + 1) = uj (t)

, ∀j 6= k

,

(4.13)

where adjustment parameter 0 < λk (t) < 1 is actually assumed to be a constant, i.e., λk (t) = λ.

4.2

Belief-Based Learning

In reinforcement learning, the players make their decisions based on past payoffs from actions regardless of the history of the opponents’ play. In belief-based learning, the players are assumed to be rational; they are able to maximize and best respond. Whereas in reinforcement learning the players are assumed to be behavioralistic animals who respond to stimuli, in belief-based learning the players are usually naive, cognitive creatures who hold beliefs and maximize their payoff regards these beliefs. Belief learning dates back to Cournot [40], who suggested an adjustment process where the players choose best responses to to behavior observed in the previous period. Both Brown [22, 23], see also Brown and von Neumann [24], and Robinson [159] proposed in the Fifties fictitious play, where the players respond to the empirical frequency of opponents’ play. Besides these two learning models, we will examine some modifications of the fictitious play. Fudenberg and Kreps [61] introduced stochastic fictitious play, and Fudenberg and Levine [65, 67] improved its properties. Kalai and Lehrer [109] abandon the assumption of players’ naivete and study rational learning, where the players learn to predict the future play of the game by revising their subjective beliefs by Bayesian updating.

4.2.1

History

Cournot [40] defined an equilibrium as a result of an adjustment process. In Cournot adjustment, the players naively assume that the opponents will choose the strategies they chose in the last period, and choose a best response to this belief. If the opponents of player i played strategy s−i (t) at time t, then player i will choose his strategy si (t + 1) ∈ BR(s−i (t)), 37

(4.14)

where BR(s−i ) = arg maxsi ∈Si ui(si , s−i ). Both Brown [22, 23] and Robinson [159] studied an iterative procedure, fictitious play, to solve a finite two-person zero-sum game6 . Since linear programming problems can be reduced to the solution of symmetric games, fictitious play also provides a method for general linear programming problems7 . Later on, the method is interpreted as a pregame introspection, or as a mental or cognitive tˆatonnement. Thus, it provides an explanation how players might come up to play the equilibrium strategies. In Brown model [23], the players are imagined to be statisticians who keep track of the opponents past plays and, in absence of a more sophisticated calculation, each time choose the optimal pure strategy against the mixture represented by all the opponents past plays. Brown argued that no matter where the statisticians began, they would iterate towards the solution of the game. A zero-sum game, is presented by its payoff matrix A = (aij ) ∈ Rm×n , where the first player chooses one of the m strategies and the second one of the n strategies. The element aij describes how much the second player pays to the first player. In Brown model, both players will reason similarly, but we will examine fictitious play in view of the first player. The first player has each time t a counter wj (t) for each opponent’s strategy j = 1, . . . , n, and the vector containing theses counter is W (t) ∈ Nn . If the opponents played strategy k at time t, the counters are updated wk (t + 1) = wj (t) + 1 . (4.15) wj (t + 1) = wj (t), j 6= k The first player chooses strategy that maximizes his payoff si (t + 1) ∈ arg max eTj AW (t), j=1,...,m

(4.16)

where ej is the ith coordinate vector in Rm . Robinson [159] proved the convergence of fictitious play in the limit. Robinson formulated the beliefs8 analogously with cumulative assessment for player’s own 6

Von Neumann [198], see also von Neumann and Morgenstern [199], had proven that such zero-sum games have a solution. 7 Fictitious play and its modifications are still studied in solving two-person zero-sum games, see Washburn [203]. 8 The beliefs can be defined by weights, as Brown, Fudenberg and Kreps [61], and Fudenberg and Levine [66] did, by expected payoffs, as Robinson, and Shapley [177] did, or by beliefs, as Young [210] did. Young defined the beliefs as the state variable, and the player i’s beliefs about the probability of occurring strategy k are updated by pi,k (t) = [(t − 1)pi,k (t − 1) + I(sk )]/t, where I(sk ) = 1 if the opponents chose strategy k at time t and I(sk ) = 0 otherwise.

38

strategies as in the previous section in Sarin-Vahid model. In Robinson model, these assessments are presented, for the first player, by a vector V (t) ∈ Rm . If the second player chose strategy k at time t, the assessments are updated V (t + 1) = V (t) + Ak ,

(4.17)

where Ak is the kth column of A. The first player chooses a strategy that has the biggest assessment, the biggest element of V (t + 1).

This figure is intentionally left out.

Figure 4.1: The framework for learning models [38]. Cheung and Friedman proposed that learning models consists of a learning rule, a decision rule, and a payoff function. Players hold beliefs, and choose their actions by the decision rules according to these beliefs. Actions generate outcomes by the payoff function, and the beliefs are updated by the learning rules according to the outcomes. Cheung and Friedman [38], see Figure 4.1, constructed a one-parameter learning rule that included both Cournot adjustment and fictitious play. They assume that each player i uses some discount factor γi on the older evidence. The player i memorizes the observed historical states hi (t) = (si (1), . . . , si (t)), where the state is generated by opponents’ actions. According to these states, the player i has belief P k It (sj ) + t−1 k=1 γi It−k (sj ) 9 pi,j (t + 1) = , (4.18) P k 1 + t−1 γ k=1 i where pi,j (t) is player i’s belief about the likelihood that the opponents will choose strategy sj at time t, and It (sj ) is an indicator function equal to 1 if strategy sj

was chosen at time t and 0 otherwise. With γ = 0, the rule yields Cournot adjustment, γ = 1 yields fictitious play, and 0 < γ < 1 gives adaptive learning. 9

We use similar notation as Nyarko and Schotter [151] to clarify the original rule.

39

4.2.2

Stochastic Fictitious Play

Shapley [177] showed that fictitious play need not converge; the play ends up in an asymptotically stable limit cycle in a 3x3 bimatrix game. Fudenberg and Kreps [61] suggested a solution to the problem of convergence to mixed strategy equilibria by stochastic fictitious play, see also Kaniovski and Young [114] and Hofbauer and Sandholm [92]. In stochastic fictitious play, the players choose their strategy probabilistically. The players are assumed to be boundedly rational and ǫ-best respond to empirical distribution of opponents’ strategies. Furthermore, Fudenberg and Levine [65] notice that certain randomization guarantees the player at least his minimax payoff10 . According to Fudenberg and Levine [66], other variations of fictitious play that permit randomization had been suggested earlier, but they lacked satisfactory reason. Fudenberg and Levine suggest that the reason for randomization is given by Harsanyi’s purification theorem [85] that explains mixed strategy as a result of unobserved payoff perturbations. The payoff to player i is ui (s) + ηi (si ), where ηi is a random vector of a certain form. Thus, it is assumed that the realized shock to player i’s payoff depends on the action he chooses but not on the actions of the other players.11 With these random shocks, player i’s best-response distribution BRi (σ−i ), for each distribution σ−i over the actions of i’s opponents, is given by BRi (σ−i )[si ] = P rob[ηi s.t. si is a best response to σ−i ], which measures those random shocks ηi ’s that give si as best response to σ−i , see also Hopkins [97]. With a certain form of random shocks, this best-response distribution is continuous. And this means that if the player’s assessment converges, his behavior will also converge, which is not the case with the standard fictitious play. Thus far, we have examined smooth fictitious play, in which the deterministic choice rule of fictitious play is replaced with randomized smooth best response distribution. Fudenberg and Levine [65]12 provide another explanation for randomization; a certain choice rule, cautious fictitious play, guarantees the player 10

This property was originally studied by Blackwell [16] and Hannan [81]. According to Hart and Mas-Colell [86], a strategy of a player is called Hannan-consistent if it guarantees that his long-run average payoff is as large as the highest payoff that can be obtained against the empirical distribution of play of the other players. 11 It is known in psychology that the behavior is random as two similar alternatives are judged, and the choice becomes deterministic as the alternatives become more distinct. With this interpretation, Thurstone’s [191] law of comparative judgment gives foundation to the described random utility model. [66, p.107] 12 We will, however, follow the notation of Fudenberg and Levine [66].

40

the safe outcome, almost surely at least his minimax payoff regardless of opponents’ play. This safety is guaranteed by universally consistent rule, which requires that regardless of their opponents’ play, the players almost surely get at least as much utility as they could have gotten had they known the frequency but not the order of observations in advance [66, p. 117], see also Fudenberg and Levine [67]. The universal consistency is satisfied, for example, by assuming that i) a smooth fictitious play has utilities of the form ui(σ) + λνi (σi ), where λ is the parameter describing the level of noise; and ii) that the random shocks are given by νi (σi ) =

X si

−σi (si ) log σi (si ),

(4.19)

the smooth best response distribution can explicitly solved exp(ui(si , σ−i )/λ) , BRi (σ−i )[si ] = P ri exp(ui (ri , σ−i )/λ)

(4.20)

and this form is called logistic probability function.

Now that we have introduced an explicit probabilistic choice function, we note that stochastic fictitious play is actually similar to reinforcement learning [66, Chapter 4]. The difference is in Cheung and Friedman’s learning and decision rules. In reinforcement learning only one of the propensities is updated, whereas in stochastic fictitious play all propensities are updated13 . The decision rules based on propensities are different, but both are probabilistic.

4.2.3

Interpretation of Fictitious Play

As mentioned in Section 4.2.1, fictitious play has two interpretations, see Hendon [87]: • Learning. The game is played repeatedly, and players respond to the opponents’ empirical play. This relaxes the assumptions of game theory; the players might not know the opponents’ utility functions or they might not have the same information of the game. • Preplay reasoning. The players tests his theory by asking whether the other players would act accordance to the theory. If there is difference, 13

Actually, Robinson’s model suggests that the belief on opponents’ strategies can be interpreted as propensities on players own choices.

41

the theory is changed in direction of the best reply. If the player’s theory converges, the convergence point gives the prediction of the play of the game. When the game is actually played, the player does his part by choosing the strategy of the convergent point. Hendon [87] studies fictitious play in extensive form games. He explains that the two above interpretations give the same dynamics for strategic form games, but not for extensive form games. He also shows that the convergence point of the both dynamics is a sequential equilibrium, and for generic games of perfect information, a fictitious play sequence always converges to the unique subgame perfect equilibrium. The difference of the interpretations in extensive form games is that in learning, the beliefs are updated only in the information sets that were along the path of play. Whereas in preplay reasoning, the beliefs are updated at all information sets, because the player can calculate the opponents’ imaginary play also off the path of play. Thus, we need to distinguish the learning model and the iterative solution method in extensive form games.

4.2.4

Bayesian Learning

Thus far, we have studied boundedly rational or naive learning models, where players do not attempt to influence other players’ future actions. Now, we consider Bayesian learning, also known as rational [17] or sophisticated learning, where players hold subjective beliefs about his opponents’ strategies, maximize their payoff relative to these beliefs, and revise the beliefs in light of new information according to Bayes’ rule and a correct statistical model, see Gilli [74, 75]. The research of rational learning is mostly based on Kalai and Lehrer [109]. They show that if the subjective beliefs are compatible with the true strategies chosen, then rational learning will lead in the long run to accurate prediction of the future play of the game. Thus, players will eventually play according to a Nash equilibrium of the repeated game14 . Sobel [181, p.15] mentions two limitations to the Kalai and Lehrer’s results. First, the agents’ behavior may be far from equilibrium for a long time even if the process will asymptotically converge. And second, Nachbar [144] has shown that if a player is able to predict future play in nontrivial environments, then he will 14

Jordan [102] obtains the similar conclusions deriving the players’ beliefs as a Bayesian Nash equilibrium based on a common prior distribution over player characteristics.

42

not have the ability to optimize. But as Blume and Easley [17] puts it: it is not issue whether rational learning will occur, but what result it produces. Kalai and Lehrer [109] study an infinitely15 repeated game with perfect monitoring and discounting. In every period each player chooses his individual action according to a fixed matrix which specifies his payoff for every action combination taken by the group of players. Perfect monitoring means that before making the choice of a period’s action, the player is informed of all the previous actions taken. Each player has a discount factor to evaluate future payoffs, and the players goal is to maximize the present value of his total expected utility. Kalai and Lehrer depart from the standard assumption of game theory by not requiring that the players have full knowledge of each others’ strategies, nor do they have commonly known prior distribution on the unknown parameters of the game. They replace this assumption with a weaker one of compatibility of beliefs with the truth; the players’ subjective beliefs do not assign zero probability to events that can occur in the play of the game. They say that the beliefs contain a grain of truth. In addition to perfect monitoring, knowledge of own payoff matrices, and compatibility of beliefs with the truth, the model contains several additional restrictive assumptions. First, it is assumed that players’ actual strategies are chosen independently. Second, independence is also imposed on the subjective beliefs of the players. Third, the assumption that players maximize their expected payoffs is quite strong for infinitely repeated games.16 Kalai and Lehrer’s result is that the behavior induced by a subjective equilibrium approximates the behavior of an ǫ-Nash equilibrium, and this is proven in Kalai and Lehrer [110]. It is interesting to notice that the subjective rational approach overcomes the difficulty of experimentation17 , because every action 15

Kalai and Lehrer’s conditions do not ensure convergence in finitely repeated game, see Sandroni [167]. 16 Fudenberg and Levine [63] study equilibria as the outcome of a learning process. They introduce the concept of self-confirming equilibrium. They show three reasons why self-confirming equilibrium outcome may not be the outcome of a Nash equilibrium: (i) two players have inconsistent beliefs about the play of a third at an information set that is relevant to both of them; (ii) a player’s subjective uncertainty about his opponents’ play may induce a correlated distribution on their actions, even though he knows that their actual play is uncorrelated; and (iii) each player can have heterogenous beliefs, that is, different beliefs may rationalize each pure strategy. Actually, Kalai and Lehrer’s private-equilibrium [108] is a refinement of self-confirming equilibrium [63, p.524]. Now, we may compare Kalai and Lehrer’s assumption with the three reasons: perfect monitoring removes inconsistent beliefs, the assumption of uncorrelatedness removes correlated equilibrium, and they both remove the heterogenous beliefs. 17 The players may continue to hold incorrect beliefs about opponents’ play unless they engage in a sufficient amount of experimentation [64, p.547]. The problem of experimentation

43

in the model, including experimentation, is evaluated according to its long run contribution to expected utility. Thus, a player will experiment when he assesses that the information gained will contribute positively to the present value of his expected payoff.18 Rational learning provides an explanation how players might learn to make equilibrium decisions through repeated experiences. However, finding beliefs that hold a grain of truth requires the same kind of fixed-point solution that finding a strategic equilibrium does [66, p. 238]19 . The finding of correct conjectures in Nash equilibrium is replaced with a new and weaker equilibrium condition, where the game is repeated, players may have some disagreements along the course of the play, and ultimately the disagreement disappears.

4.3

Individual Learning

In the last section, we noticed that Cournot learning and fictitious play could be presented by Cheung-Friedman model. We also saw in Robinson model that the beliefs could be presented with propensities like in reinforcement learning. Thus, belief-based models and reinforcement learning are quite similar, but the difference is the update of propensities and the choice of action based on these propensities. These learning models have been studied and thought as two separate learning models until recently. We will examine a learning model that has the previous models as special cases. We will also study rule learning that generalizes learning models. We conclude this section in a summary that describes a general learning model in view of the is discussed in Fudenberg and Kreps [60] and Fudenberg and Levine [64]. We saw the same effect in fictitious play; the deterministic version did not converge, but the stochastic version had enough experimentation and did converge. 18 Rational learning seems to require a lot of reasoning from the players. Thus, Eichberger [47] suggests that Bayesian learning is relaxed a little. He studies naive Bayesian learning, which is based on two principles: (i) players behave optimally given their beliefs about opponents, and (ii) players use Bayesian updating rules to learn the correct distribution of the opponents’ pure strategy play. He shows that converging pure strategy choices lead to a Nash equilibrium (i) if either prior beliefs have full support, or (ii) if, for every neighborhood of a Nash equilibrium strategy, the absolutely continuous part of the prior distribution has positive measure. 19 As noted earlier, Nachbar [144] showed that it is not sufficient that players simply have cautious beliefs, which puts positive prior probability on the full support of some set of conventional strategies. Nachbar showed that players can both predict the future play and optimize only if beliefs are actually in equilibrium at the start of repeated game. Thus, Kalai and Lehrer’s result holds only if the beliefs are in equilibrium. Actually, Jordan [102, 101] studied exactly this case. Jordan showed that players learn to play Nash equilibrium when the initial expectations are given by a common prior distribution over player types.

44

previous two sections. We will also discuss situations where the players have very limited information or collecting information is costly. In these cases, the sophisticated learning models are not available, and reinforcement learning must be used.

4.3.1

Experience-Weighted Attraction Learning

Camerer and Ho [32, 33], see also Camerer [31] and Camerer et al. [34, 35], proposed experience-weighted attraction (EWA) learning that incorporated both reinforcement and belief-based learning models. EWA model has many parameters, but it consists mainly of update of the attraction of strategies and the weight of past experience. The attractions corresponds the propensities in reinforcement learning, and the weight of past experience matches the memory of belief-based learning. The difference of deterministic and probabilistic choice is noticed by a parameter in the choice rule. Camerer and Ho denote the attraction of player i’s strategy j after period t has taken place by Ai,j (t) and the weight of past experience by N(t). The first update rule is N(t) = (1 − κ)φN(t − 1) + 1,

t ≥ 1,

(4.21)

where φ is the decay rate that reflects a combination of forgetting and the degree to which players realize that other players are adapting, and the parameter κ is a growth rate of attractions, which reflects how quickly players lock in to a strategy20 . The attraction update is a combination of fictitious play and reinforcement learning. Camerer and Ho introduce law of simulated effect, which weights hypothetical payoffs that unchosen strategies would have earned, like in Robinson model. The difference is that the chosen strategy is additionally reinforced compared to the unchosen strategies. The attraction is updated by actual effect

old attraction

}| { }| { z z φN(t − 1)Ai,j (t − 1) +[δ + (1 − δ)I(si,j , si (t))]πi (si,j , s−i (t)) , Ai,j (t) = N(t)

(4.22)

where si,j is player i’s strategy j, si (t) the strategy chosen by i at time t, I(x, y) is an indicator function that equals 1 if x = y and 0 if x 6= y, and hypothetical payoffs are reinforced with factor δ and the chosen strategy with an extra 1 − δ.

The probability of player i choosing a strategy j at time t, pi,j (t), depends on the 20

Camerer and Ho [32, 33] define ρ = (1 − κ)φ and call it the rate of decay for experience. The new κ notation makes the difference of averaged or cumulated attractions.

45

attraction Ai,j (t). However, there is no unique way of determining the probabilities. Camerer and Ho suggest three forms: logit, power, and probit. The logit function21 is given by eλAi,j (t) Pi,j (t + 1) = Pmi λA (t) , i,k k=1 e

(4.23)

where the parameter λ measures the sensitivity of players attractions, and mi is the number of player i’s choices. The logit form is commonly used in choice under risk and uncertainty, and it is invariant to adding a constant to all attractions [33, p. 835]. The logit form was originally studied in explaining human behavior by McFadden [131]. The power function is given by Ai,j (t)λ Pi,j (t + 1) = Pmi . λ k=1 Ai,k (t)

(4.24)

The power form is invariant to multiplying all attractions by a constant. A special case of power form is λ = 1, which gives the Roth-Erev choice rule, and it was originally proposed by Luce. The third, probit function is given by normal distribution, and it was studied, for example, by Cheung and Friedman [38]. The question is which of the forms fit to the empirical data the best. Finally, Camerer and Ho show that Cournot learning, fictitious play, weighted fictitious play, cumulative reinforcement, and average reinforcement22 can be presented with some parameters of EWA model [34]. Figure 4.2 presents the EWA learning cube. Furthermore, EWA model is extended in Camerer et al. [34, 35] by adding sophistication and strategic teaching23 . 21

We met this form in Fudenberg and Levine’s [65, 66] stochastic fictitious play, and it guaranteed safe outcome regardless of opponents’ play. 22 Weighted fictitious play is Cheung and Friedman’s adaptive play, cumulative reinforcement is Roth and Erev’s model, and average reinforcement is Sarin and Vahid’s model. 23 These innovations overcome the problem of naivete, see Fudenberg and Levine [66, Chapter 8]. Sophistication means that some players understand how others are learning. This can be added by assuming that a fraction α of players are sophisticated. Sophisticated players think that 1 − α of players are adaptive, that is, they respond only to their own previous experience and ignore others’ payoff information, and the remaining players are sophisticated, see quantal response equilibrium by McKelvey and Palfrey [132, 133]. The teaching is seen when sophisticated players are matched with the same players repeatedly and they have incentive to teach adaptive players by choosing strategies with poor short-run payoffs. And this will change what adaptive players do in a way that benefits the sophisticated player in the long run. [34]

46

This figure is intentionally left out.

Figure 4.2: The EWA learning cube [34]. The edges δ = 0, κ = 1, N(0) = 1 and δ = κ = 0, N(0) = 1/(1 − φ) corresponds cumulative and average reinforcement, respectively. Especially, we get Roth-Erev model when power form is used with λ = 1. The edge δ = 1, κ = 0 and λ = ∞ [55, p. 614], corresponds weighted fictitious play; Cournot learning when φ = 0 and fictitious play when φ = 1.

4.3.2

Rule Learning

Thus far, we have considered two fundamental learning models and shown that these models are actually special cases of one general learning model. Some authors have gone even further by suggesting even more general learning models that might explain experimental results or include more psychological laws of human behavior24 . One of these authors is Stahl [184, 185, 186], who suggested rule learning. The framework of rule learning is presented in Figure 4.3.25 The first step to use this conceptual framework is to represent the huge space of potential behavioral rules26 . Stahl suggests that one postulates a small set 24

Thorndike specified three conditions that maximize learning: the laws of effect, recency and exercise. The law of exercise, which is not included in any learning model we are aware of, states that stimulus-response associations are strengthened through repetition. By constructing an experiment that supports this law, we could define a new learning model that would predict human behavior even better than the existing models. Besides these laws of learning, there are the laws of readiness, association and intensity. 25 By remembering the Cheung-Friedman model and Figure 4.1, we can see the similarities between these two models. Actually, Stahl adds another layer in the decision process. Instead of putting propensities on actions, Stahl puts propensities on rules that are mapped into actions. Of course, the additional complexity will enable rule learning to explain experimental results better. 26 The family of behavioral rules includes Nash behavior, maximax behavior, modified fictitious play, reinforcement learning, and adaptive expectations [184, p. 113]. Rapoport et al. [155, p. 253] explain that players learn over time which rules are appropriate in given situation.

47

This figure is intentionally left out.

Figure 4.3: The framework of rule learning [184]. A behavioral rule ρ is a function that maps from the available information, the game and any history of play, to the family of probability measures on the actions available in the game. A propensity φ is a probability to use a specific behavioral rule. The propensities are updated by evaluating the rules after outcomes have been generated in the game. of archetypal rules and a way of combining these rules to span a large space of plausible and empirically relevant rules. The second step is to update the propensities. While reinforcement learning updates the propensities of actions, Stahl suggests that rule learning reinforces behavioral rules. Probabilities increase for rules that would have yielded higher payoffs in the recent past, and vice versa. Stahl distinguishes two kinds of learning in the general model. First, the players can learn in the sense of acquiring relevant data but sticking to the same behavioral rule. For example, in a game against nature, Bayesian updating can be viewed as a fixed behavioral rule that learns via acquiring new data. Stahl calls this kind of learning as data learning. Second, in rule learning, the players can learn by assessing the relative performance of the behavioral rules and switching to better performing rules. To conclude rule learning, we refer to Stahl [186, Conclusions]: Rather, if you want a model (and a single set of estimated parameters) that predicts population frequencies well across a variety of symmetric normal-form games and environments in which players receive population feedback (in contrast to limited pairwise experience), then you should pick the horse-race winner: Rule Learning.

4.3.3

Summary

Now, based on the previous learning models and especially Cheung-Friedman model, we describe a general individual learning model. It is natural to think that the decision maker assigns some subjective assessments or propensities to the alternatives. The choices are made according to these propensities either 48

deterministically or probabilistically. The advantage of probabilistic choice is the potential experimentation; you will never know for sure, if you do not try it out. The propensities are updated as outcomes are generated and observed by the decision maker. It may not be sufficient to update the propensities directly, but instead additional layer(s) can be used, like in rule learning. For example, the decision maker may judge the alternatives with some criteria, and each alternative has weights on these criteria. Thus, the update consists of revising both the weights of criteria and the weights of alternatives. Depending on the situation, the propensities are updated according to either law of actual or simulated effect. If the decision maker can reliably figure out the possible outcomes for unchosen actions, all the propensities should be updated, like in belief-based models. Otherwise, only the propensity of chosen action will be updated, like in reinforcement learning. According to experimental results, Feltovich [55] finds that belief-based models describe aggregate behavior better, while the reinforcement learning makes more accurate predictions of behavior at individual level. Many authors have noticed that reinforcement learning describes decision making better in complex environment, where the information is not simply available; actually this is one motivation of Sarin-Vahid model. Van Huyck et al. [196], Chen and Khoroshilov [37], and Vostroknutov [201] study learning under limited information. Rustichini [163] distinguishes that an individual learning in isolation has partial information, because he can only observe the payoff to the action he has chose; and in social learning, which we will study later on, he has full information, because he can learn from the action of the others. Rustichini finds that the information has crucial importance to efficiency of update procedures27 . Furthermore, Easley and Rustichini [45] provide an axiomatic foundation for decision making in a complex environment. Sobel [181] suggests that agents can collect information passively or actively, see also Kandori et al. [113, p. 31]. Passive information collecting takes place as an outcome of an adaptive process, and agents are unable to influence the quantity or quality of the information they obtain. Learning is active when agents’ choices determine the flow of information, and it will be costly if it takes resources to acquire or process information. In Sobel’s models, the cost of learning is not a direct cost associated with purchasing information, but it is associated 27

Linear procedures, or in our terminology Roth-Erev basic model, always converge to optimal action in the case of partial information, and do not in the case of full information. Exactly opposite is true in the case of exponential procedures.

49

with knowingly making a suboptimal decision in one period in order to obtain information that will improve future decision making, see also El-Gamal et al. [50]. On the contrary, Friedman [59] distinguishes learning models by the level of rationality. Reinforcement learning assumes the lowest possible level of rationality, and rational learning accepts only perfectly rational individuals who use Bayesian updating. Belief-based models fall somewhere between these extremes. The general learning model should also be based on psychological facts. One is reinforcement of good choices. Another is cognitive limits; the decision maker has limited computational capacity and limited memory. Actually, forgetting can also be seen as improved adaptation to changing environment. Furthermore, the general learning model should satisfy some efficiency conditions. Laslier et al. [119, p.3] state three conditions: • The learning process must be optimizing in a stationary environment, which means that it converges to the utility maximizing action. • It must be flexible in a nonstationary environment, which allows for an extended exploration at its beginning and avoiding lock-in early to some undesirable action or cycle. • It must be progressive in any kind of environment, allowing a smooth transition between exploration and exploitation so that experimentation slows down.28

The result is that cumulative proportional reinforcement rule, or basically RothErev model, satisfies these three efficiency conditions, and that fictitious play does not satisfy the second conditions, because of the cycling discussed in the previous section. Furthermore, the smooth fictitious play does not satisfy the first condition, because it cannot converge towards a pure action; hence it excludes maximizing actions. We have not and will not discuss experimental results in greater detail. We suggest to study Roth and Erev [162], Erev and Roth [52], Sarin and Vahid 28

Camerer et al. [34] give also properties that empirical models should have: i) the model should use all the information that the players have and use, ii) the parameters of the model should have psychological interpretations, iii) the model should be as simple as possible, iv) the model should fit well, both in- and out-of-sample, judging by statistical criteria which permit model comparison, and v) the model should be tractable enough to explore its theoretical implications. EWA does well on all criteria, but is incomplete by the information-use and psychological fidelity criteria, because it does not explain how players’ information about the payoffs of others is used, and it does not allow the sort of anticipatory learning which is plausible for intelligent experienced players.

50

[169, 170], Mitropoulos [136], Camerer and Ho [33], and Camerer [31]. And for comparison of models, see Feltovich [55], Salmon [165], and Hopkins [98]. Feltovich [55] concludes that the superiority of a model depends not only the experimental data set used, but also on the criterion of success. The difference of reinforcement learning and belief-based models is much smaller than between either of them and equilibrium play.

4.4

Evolutionary Learning

Instead of studying the strategic interaction of individuals, evolutionary learning describes how the frequencies of strategies within a population change in time, according to the strategies’ success, see Friedman [58]. Each player in the population is programmed to use some strategy29 , and strategies with high payoffs will be reinforced. The payoffs depend on the other players’ actions and hence on the frequencies of the strategies within the population. The first evolutionary process was replicator dynamics, where the growth rate of a strategy is given by the payoff difference of that strategy and the current average payoff. But there are many other game dynamics that describes evolution and they have different foundations. For evolutionary learning, see Weibull [205] and Hofbauer and Sigmund [93, 94]. Evolutionary models are not structural models of learning or bounded rationality [123, p. 1355]. The individuals in the game are not explicitly modeled. Thus, it is questioned whether evolutionary models should be used in social learning, and is it worth the effort to modify them to suit the situation. An analogy between between human behavior and evolutionary models was suggested by Dawkins [43] who argued that ideas could be spread in a human population through a process of cultural evolution. Dawkins called the ideas which evolved in the process as memes. Maynard Smith [127] has express criticism to the idea of memes. He argues that a science of population genetics is possible because the Mendel’s laws are known, but no science of memetics is not yet possible. [28] Another interpretation of evolutionary models in economics is to describe indi29

B¨ orgers and Sarin [29] explain that decision makers are usually not completely committed to just one set of ideas, but rather several possible ways of behaving are present in their minds simultaneously. Which of these predominates, depends on the experience of the individual. The change of the population of ideas is analogous to biological evolution.

51

vidual behavior by a process of trial and error learning which individuals use to search for good behavioral rules. This idea is formalized in B¨orgers and Sarin [29] who showed formal relation between reinforcement learning and replicator dynamics. Thus, B¨orgers [28, p. 1383] summarizes: evolutionary models in economics are not more than reduced versions of special learning models; and he does not see good reasons to give special attention to evolutionary models.

4.4.1

History

The population approach was already anticipated by Nash [146]. He assumed that there is a population of participants for each position of the game. Nash described the evolution of population by a differential equation, and he claimed that this dynamics leads to play of Nash equilibrium. This mass action interpretation led to the conclusion that the mixed strategies representing the average behavior in each population form an equilibrium point. [206] Evolutionary game theory was developed by Fisher [56], who studied the sex ratio in mammals. Fisher explained the approximate equality of the sex ratio by measuring individual fitness in terms of the expected number of grandchildren, and hence individual fitness depends on the distribution of males and females in the population. However, the evolutionary game theory really began when Maynard Smith [126, 128], see also Maynard Smith and Price [129], argued that many interactions could be interpreted as strategic situations, and that mutation and natural selection would tend to push organisms towards optimal play [187]. Maynard Smith and Price defined an equilibrium concept, an evolutionary stable strategy, which gives a strategy that is robust to evolutionary pressure. A population playing evolutionary stable strategy is uninvadeable by any other strategy. If a small portion of mutant strategies enter the system, the incumbent strategy survives better than the mutant strategy. Let us denote the incumbent strategy x and some mutant strategy y 6= x, where both strategies are certain pure or mixed strategies in the game. The, strategy x is evolutionary stable strategy, if for each strategy y 6= x the following holds for all sufficiently small ǫ > 0 u(x, (1 − ǫ)x + ǫy) > u(y, (1 − ǫ)x + ǫy),

(4.25)

where u(x, y) is the payoff strategy x gets in population y, or presents the fitness of a strategy x in an environment y. It can be shown that this condition is 52

equivalent to the following conditions u(x, x) ≥ u(y, x)

(equilibrium)

u(x, x) = u(y, x) ⇒ u(x, y) > u(y, y) (stability)

.

(4.26)

The replicator dynamics of Taylor and Jonker [188] was the first explicit model of a selection process that specified how the pure strategies in the population evolve over time [206]. In replicator dynamics, large population of individuals, who are randomly matched over time, play a finite symmetric two-player game. The individuals play only pure strategies, and a population state is a distribution over pure strategies. Mathematically, this state is equivalent with a mixed strategy in the game. The payoffs in the game represent the biological fitness, and the number of individuals using pure strategy will grow exponentially at a rate that equals the payoff to that strategy when played against the current population state. Thus, it follows that the growth rate of the frequency of any pure strategy equals the difference between that pure strategy’s payoff and the the average payoff in the population. Now, the evolution of pure strategy i ∈ S

in a state x is

x˙ i = xi [u(i, x) − u˜], 30 where u˜ =

P

k∈S

(4.27)

xk u(k, x) is the average payoff in the population.

In contrast to belief-based learning, which involves instant movement to best reply, replicator dynamics describes gradual movement from worse to better strategies [29]. This gradual movement may permit the replicator dynamics to converge in games where the best response does not converge. On the other hand, there are examples where fictitious play converges, but the replicator dynamics cycles. B¨orgers and Sarin [29] show that if a continuous time limit is constructed, the difference between the replicator dynamics and reinforcement learning disappears, and both models converge to the same deterministic continuous time limit. Thus, the first intuition that these two models are closely related is proven. A stable steady state of the replicator dynamics is a Nash equilibrium, see Fudenberg and Levine [66, Chapter 3]. Furthermore, it can be shown that evolutionary stable strategy is a refinement of Nash equilibrium [58], and every evolutionary stable strategy is an asymptotically stable steady state of the replicator dynamics. The converse need not be true. But Bomze [18] has shown that if the dynamics are modified so that mixed strategies can be inherited as well, the converse is true. Thus, evolutionary dynamics give an interesting alternative to the study of 30

See Ritzberger and Weibull [158, p. 7] for more general definition.

53

equilibrium.

4.4.2

Other Evolutionary Dynamics

Besides replicator dynamics, which reinforces strategies doing better than average, Hofbauer and Sigmund [94, Section 3] defines other evolutionary dynamics: • Imitation dynamics allows strategies to be transmitted within population through imitation, see also Schlag [173]. Weibull [205] suggests that individuals occasionally adopt another strategy in the population with a certain probability which can depend on the payoff difference. • Best response dynamics allows more sophistication, see Gilboa and Matsui [71], Matsui [125], Hopkins [97], Ely and Sandholm [49] and Sandholm [166]. In a large population, a small fraction of the players revise their strategy by choosing best reply BR(x) to the current mean population strategy x. The dynamics is given by x˙ ∈ BR(x) − x.

(4.28)

Since best replies are not in general unique, the dynamics is a differential inclusion rather than a differential equation. Actually, the best response dynamics arises as a continuous approximation of the discrete fictitious play. • Smoothed best replies allows the best responders to make errors. This

is actually continuous version of smoothed fictitious play, see Fudenberg and Levine [66, Section 2.8], x˙ = BR(x) − x,

(4.29)

where BR(x) is a smoothed best reply to population x, see Eq. 4.20. • Brown-von Neumann-Nash dynamics defines an innovative better re-

ply. Strategies with payoff below average decrease in frequency, while strategies with payoff above average increase, as long as they are rare enough. The dynamics is given by x˙ i = ki(x) − xi

n X

kj (x),

(4.30)

j=1

where ki(x) = max(0, u(i, x) − u˜) denotes the positive part of the excess payoff for strategy i and u˜ is the average payoff in population x. This 54

dynamics was considered by Brown and von Neumann and used by Nash. • In Myopic adjustment dynamics the population always moves towards a better reply to the present state, giving a minimal requirement for any adaption process. The adjustment property is given by x˙ T u(x) ≥ 0,

(4.31)

where u(x) contains the payoffs for all pure strategies in population x, and all the dynamics above satisfy this property. Evolutionary game theory offers a different approach to the learning paradigm, and new dynamics can be easily generated based on the existing models. But the question is whether the dynamics have psychological foundation and can they depict learning in different types of games.

4.4.3

Social Learning

Fudenberg and Levine [66, p. 51] give three main reasons to study evolutionary models: • Although the original evolutionary model was motivated by a biological evolution, the process can also describe the emulation of economic agents. • Some properties of the replicator dynamics extend to various classes of more general processes that may correspond to other sorts of learning.

• Evolutionary models have proved helpful in understanding animal behavior, it is an interesting use of the theory of games. Evolutionary models have been successfully used in evolutionary biology. Inspired by biological techniques, such as inheritance, mutation, natural selection and recombination, evolutionary algorithms are used as a computational tool in optimization. Evolutionary algorithms, i.e. genetic algorithms, evolutionary strategies and evolutionary programming, were introduced in the Seventies by Rechenberg [157] and Holland [95]. Recently, social scientists have become interested in explaining social phenomena by these evolutionary models. But can the same dynamics be used in social or cultural evolution? Before applying the dynamics, the biological techniques should be justified. In this section, we address this issue. Evolutionary models have many applications in social sciences. Young [208] studies evolution of conventions, that is, patterns of behavior in society that are cus55

tomary, expected, and self-enforcing. Robson [160] studies different foundations to the evolution of strategic behavior. He finds that cultural inheritance may differ substantially from genetic inheritance, by producing change much more rapidly than genetic evolution does. Hopkins [96] shows that the aggregation of learning behavior can be qualitatively different from learning at the level of individual, and the aggregate dynamic belongs to the same class with several formulations of evolutionary dynamics. The social learning is also studied in Sobel [181], Demichelis and Ritzberger [44], Young [209], Weibull [207], Kosfeld et al. [117], and Fudenberg and Levine [66, Chapter 3]. Biological models usually consist of one population of single species, who are randomly matched to play a symmetric two-player game. Many economic applications call for multi-population, rather than single-population dynamics [158]. In economics, the interpretation is that one individual from each of the playerpopulations and each individual is programmed to use a pure strategy available to the player whose role she plays. Fudenberg and Levine [66, p. 5] distinguishes three matching models: • Single-pair model. Each period a single pair of players is chosen at random to play the game. At the end of the round, their actions are revealed to everyone. If the population is large, players will behave myopically because it is likely that the players will remain inactive for a long time. • Aggregate statistic model. Each period all players are randomly matched. At the end of the round, the population aggregates are announced. Once again, if the population is large, the players have no reason to depart from myopic play because each player has little influence on the population aggregates. • Random-matching model. Each period all players are randomly matched. At the end of each round, each player observes only his own match. Myopic play is approximately optimal if the population is finite but large compared the players’ discount factors. This is because the players actions today is unlikely to influence the opponents’ play tomorrow. Another important modeling issue is the interpretation of mixed strategies as a description of behavior in population. Mailath [123, p. 1358] gives two leading interpretations: either the population is monomorphic, in which every member of the population plays the same mixed strategy; or the population is polymorphic, in which each member plays a pure strategy and the fraction of the population playing any particular pure strategy represents the share in the mixed strategy. The distinction is whether players can play and learn mixed strategies. 56

Kandori et al. [113] and Brenner [21] give interpretation to the evolutionary techniques. Kandori et al. intend their model as a contribution to the literature of bounded rationality and learning. Their model is based on three hypothesis: • Inertia hypothesis states that not all players need to react instanta-

neously to their environment. This is because players observations may be imperfect, their knowledge of how payoffs depend on strategy choices may be tenuous, and changing one’s strategy may be costly. The presence of inertia is then due to uncertainties and adjustment costs. Thus, only a small fraction of players are changing their strategies simultaneously.

• Myopia hypothesis assumes that players react myopically. Because of

the inertia, strategies today are likely to remain effective for some time in the future, and thus acting myopically is justified. The myopia also captures a second important aspect of learning, imitation or emulation. The players conform to choosing strategies working well, because the world is a complicated place and players cannot calculate best responses.

• Mutation or experimentation hypothesis assumes that there is a small probability that players change their strategies at random. An economic interpretation is that a player exits with some probability and is replaced with a new player who knows nothing about the game and so chooses a strategy at random. Kandori et al. give three interpretations to the model with differing types of bounded rationality. The first interpretation is that players gradually learn the strategy distribution in the society. A small fraction of the population is exogenously given opportunity to observe the exact distribution in the society, and take the best response against it. Uninformed players do not change their strategy choice, even though they may receive partial knowledge of the current strategy distribution through random matching. Such cautious behavior might be inconsistent with Bayesian rationality, but it seems to capture a certain aspect of observed behavior. The second interpretation is that players are completely naive and do not perform optimization calculations. Players sometimes observe the current performance of other players, and simply mimic the most successful strategy, see also Schlag [173]. Players are less sophisticated in that they do not know how to calculate best replies and are using other players’ successful strategies as guides for their own choices. In the third interpretation, players are rational with perfect foresight, but there is significant inertia. For example, the cost of changing the strategy may be large 57

enough, and the opportunity to change strategy occurs infrequently. Brenner [21] finds that evolutionary algorithms are promising for description of social evolution and learning processes. Mutation, or stochastic variation, can be explained by mistakes, curiosity, and external events. Selection is the result of unsatisfactory behavior which then will be aborted. Finally, evolutionary replication corresponds people imitating others. However, there are some crucial differences. First, the evolutionary algorithms do not account past performance of individuals. The people have a chance to learn from their failures. The players who have gone bankrupt do still exist afterwards in the population. Second, the fitness is more complex in social context. The personal success do not correlate to the number of children. The fitness is rather subjectively determined whereas in biology it can be defined objectively by number of children. And third, evolutionary algorithms do not consider motivational effects within selection. The selection pressure in social evolution depends very much on the satisfaction of the individuals. Thus, the selection process is influenced by the distribution over strategies in a complicated way which cannot be reproduced by common evolutionary algorithms.

4.5

Refinements of Equilibrium Concepts

In the previous chapters, we explained learning, or more specifically experimental behavior, with different learning models. We noted that reinforcement learning describes individual learning in an environment where the player knows only the payoff associated with his choice, and that belief learning requires more information because the unchosen actions will also be reinforced. Moreover, it is situation dependent which learning model describes the experimental results the best. Another approach is to study equilibrium refinements or steady states of an adjustment process involving learning. As Gilli [75, p. 3] puts it: if it was possible to use precise formal models of learning processes, equilibrium states would not be relevant, because these processes would describe the actual behavior of an economic system even out of equilibrium. But we do not have such process, and we must answer the question “what is the equilibrium if this is the environment and this is how the players learn”. In the early literature, equilibrium depicted how players would play the game or how a game theorist might recommend to play the game [90]. The interpretation of mixed strategy by Harsanyi [85] and the concept of correlated equilibrium by 58

Aumann [3] represent equilibrium as expectations of the others as to how a player will play. The first wave31 of equilibrium refinements includes subgame perfect equilibrium [174], Bayesian Nash equilibrium [84], and conjectural equilibrium [77, 78, 79]. The subgame perfect equilibrium did not, however, capture all that is implied by the idea of backward induction, which notices the rationality in the future, see Fig. 4.4. The second wave of refinements, perfect [175], proper [139], sequential [118], and persistent equilibrium [112], offered solution to modeling the idea of backward induction. Perfect and proper equilibrium refinement eliminates irrationality in unreached information sets by suggesting that complete rationality is viewed as a limiting case of incomplete rationality; players make mistakes with small vanishing probability, and the equilibrium is the limit of the corresponding behavior. Sequential equilibrium refinement offers more appliable approach by specifying a system of beliefs that defines players’ belief at each information set, and the beliefs should be consistent with the strategies actually played and the structure of the game.

2,5

T 1.1

M

U

5,1

D

0,0

U

4,1

D

0,0

2.1

B

Figure 4.4: Backward induction in extensive form games. The irrational behavior in 2.1 cannot be eliminated by subgame perfection. We can, however, see that player 2 will choose U regardless of player 1’s action. By backward induction, player 1 will choose M in 1.1 (5 > 4 and 5 > 2).

Kohlberg and Mertens [116] introduced a concept of forward induction, which notices the rationality in the past, see Fig. 4.5. Besides forward induction, they gave requirements for strategic stability. They defined a number of refinements, but the refinement satisfying all the requirements was not given until Hillas [89]. 31

This is completely my own classification.

59

Moreover, the third wave of refinements studies rationalizability introduced by Bernheim [12] and Pearce [153]. Rationalizable strategies are rationally justifiable and cannot be discarded on the basis of rationality alone, see connection to backward and forward induction in Battigalli [7], and Battigalli and Siniscalchi [10].

2,5

T 1.1

M

U

3,1

D

0,0

U

0,0

D

1,3

2.1

B

Figure 4.5: Forward induction in extensive form games. Player 1 is guaranteed outcome (2, 5) by choosing T , does this mean that player 1 can force outcome (3, 1)? Player 2 can figure out that he moves only if player 1 has chosen M, because T strictly dominates B (max(0, 1) < 2). Therefore, player 2 chooses U (1 > 0), and player 1 chooses M (3 > 2).

The fourth wave of refinements includes self-confirming [63] and subjective equilibrium [108, 110], and imperfect monitoring [74, 75]. This wave offers important aspects in view of humane behavior. It may well be that players hold subjective beliefs, and these beliefs may remain incorrect in information sets that are not reached during the repetition of the game. The play of the game self-confirms the players, and they never learn from their mistakes. Thus, equilibrium is a result of learning rather than rational introspection. Another important aspect is imperfect monitoring. The players may not observe all the opponents’ actions, see also Kalai and Lehrer’s environment response function [111]. These aspects can easily be motivated by the cost of observation or by players’ ignorance. Fudenberg and Levine [63, 64], see footnote 16 in page 43, suggested a selfconfirming equilibrium in extensive form games, which requires that players’ beliefs are correct along the equilibrium path of play. The concept is motivated by the idea that equilibria should be interpreted as the outcome of a learning process, in which players revise their beliefs using their observations of previous 60

play. In Nash equilibrium, each player’s beliefs about the opponents’ play are exactly correct. In self-confirming equilibrium, players may have incorrect beliefs in contingencies that do not arise when play follows the equilibrium, the observed play never contradicts the players’ beliefs, and the beliefs of different players may be wrong in different ways. Thus, Fudenberg and Levine give conditions when and why the learning might not converge to Nash equilibria. Gilli [74, 75, 73], see also Battigalli [6] and Battigalli and Guatoli [8, 9], studies imperfect monitoring games, where players receive private signal depending on the strategy chosen by all the players. He provides a rational learning model for general strategic situations, showing asymptotic convergence to conjectural equilibrium. In conjectural equilibrium, first defined by Hahn [77, 78, 79], see also von Hayek [197], players hold conjectures, or speculations, on the outcome of the game, and the signals induced by the strategy profile do not contradict these conjectures that rationalize their choices. This concept generalizes the equilibrium by not requiring that players can observe all the moves in the game, and selfconfirming equilibrium is a special case of conjectural equilibrium.

61

Chapter 5 Adjustment Process We formulated necessary conditions in Chapter 3 to the Bayesian game, where the principal knows the possible types of agents, but does not observe the type of the agent. The solution of the game is given by these conditions if the earlier mentioned requirements1 are satisfied. Now, we propose that the principal does not know the possible utility functions, but she solves the problem iteratively by evaluating the necessary conditions. In order to be able to do this, the agents must behave myopically. They should not have the ability to manipulate the process. We propose gradient evaluation method that finds the optimal amounts in the bundles one by one. In the method, the principal holds a one dimensional interval of uncertainty, where the optimal amount is. The principal will narrow the interval down by evaluating the gradient, or necessary condition, for some amount in the interval. If the gradient is positive, then the principal knows that her profit will increase by increasing the amount. Thus, the lower bound of the interval should be set to this amount. On the contrary, if the gradient is negative, the upper bound of interval should be set to this amount. If the initial interval contains the optimal amount2 , the interval will converge to the amount that satisfied the necessary condition. And we say that the principal learns the Bayesian Nash equilibrium. 1

The most important requirements are (i) the utility functions need to satisfy the single crossing property, and (ii) the distribution of types needs to be of a certain shape. 2 One alternative to guarantee this is to choose initial lower and upper bounds to 0 and the first-best amount, respectively.

62

5.1

Introduction

We study the Bayesian game we defined in Chapter 3 with some modifications. Instead of the earlier stage game, we study a recurring game [100], where the same principal faces repeatedly a population of agents with H different classes. Each class corresponds a type of agent and is characterized by a utility function. The probability of a class represents the fraction of population, and these fractions are assumed to be stationary throughout the game. In every stage of the game, the principal serves H agents, one chosen from each class. Furthermore, we make the following assumptions: i) The utility functions satisfy the requirements introduced in Chapter 3. ii) The probabilities of classes are favorable so that the unique solution of the game, Bayesian Nash equilibrium, is given by Eq. 3.8. iii) The agents behave myopically by maximizing their utility in each stage of the game. This is motivated by assuming that there is a large population of agents, and one of the agents is chosen to play the game [66]. The agents will be naive, for they think that their action will not influence the principal’s prices, because they are not likely to play the game in the near future. iv) The principal knows the number of agent’s types3 and the probabilities of agent’s types but not any of the utility functions. The first assumption means that the principal knows that the agents’ utility functions are concave, the slopes are decreasing in amount. The second assumption guarantees that it is sufficient to find the amounts that satisfy the necessary conditions. And, the third assumption makes the extraction of agents’ preferences possible; by offering unit prices, the principal can learn the slopes of agents’ utility functions. Finally, the last assumption means that the only uncertainty is the agents’ utility functions. But how can the principal evaluate the gradient? First, we show how the principal can learn the slopes of agents’ utility functions. Suppose, the principal offered a linear tariff, that is, a fixed price b and a unit price a, t(x) = ax + b. Now, type i of agent solves xi = arg maxx Vi (x) − t(x) = arg maxx Vi (x) − ax − b, and the solution is given by Vi′ (xi ) = a. Thus, because of myopicity, the

principal can learn the slope of agents’ utility functions. Furthermore, because of the first assumption, the slopes are increasing in type. Thus, the types of agents 3

This actually follows from i) and iii), because the principal can offer a positive unit price for the good, and this will differentiate the myopic agents. After one round, the principal can simply count the number of agents from the agents’ actions.

63

will be differentiated; the lowest type will choose the smallest amount, the second lowest type will choose the second smallest amount, and so on. Second, we show that the principal can learn the slope of any given type for any given amount. Suppose, the principal wanted to evaluate type i’s slope for an arbitrary amount x0 . Because the agents’ slopes are decreasing in amount, the principal finds the wanted slope iteratively by following the i’s biggest chosen amount and adjusting the unit price. If the chosen amount xi < x0 , the principal needs to decrease the unit price, and type i will choose a bigger amount. Similarly, the unit price is increased if x1 > x0 . Finally, the principal learns the unit price which makes type i choose x0 , and hence this unit price is the type i’s slope. Third, the principal can evaluate the gradient, which consists of two4 unknown utility functions. The necessary condition for type i consists of type i’s and type i + 1’s slopes. The principal can evaluate these slopes with two previously described iterative processes. And thus, the principal learns the gradient. It seems that the gradient evaluation needs a lot of iterations. We will, however, show how to decrease this amount notably. Actually, with some sophistication that is presented in the next two sections, finding one slope needs only one iteration and the gradient evaluation needs only two iterations.

5.2

Highest Type’s Algorithm

The position of the highest type in the multi-agent game resembles the one-agent model, in which the principal serves an agent, whose utility is private information. The other types’ utilities or probabilities of arrival do not affect the amount sold to the highest type. Thus, the first-best amount is sold. In the method, the principal tries to find the unique optimal linear tariff by offering tariffs according to an adjustment process. First, the principal offers a linear tariff. Then, she observes the amount the highest type chose and gets an interval, in which the optimal amount is. This interval will get narrower as the principal learns more of the highest type’s utility function, and finally the optimal amount is found. To maximize her utility, the principal wants to find the amount that satisfies the 4

The highest type is a special case, where there is only one unknown utility function. Actually, finding the highest type’s amount is notably easier than the others, and thus, we will study it first in the next section.

64

the necessary condition f1 (x∗ ) = 0, where f1 (x) = VH′ (x) − c′ (x).

(5.1)

Notice the omission of information rent, and that f1 is continuous. Although not knowing the utility function VH (x), the principal can learn the optimal amount by offering a linear tariff and observing the agent’s reaction. The principal chooses an arbitrary point (x0 , t0 ) and constructs a linear tariff going through it t(x) = ax + b = t0 + c′ (x0 )(x − x0 ).

(5.2)

The agent with type H chooses the amount xh so that the slope of the type H’s utility function is the slope of the tariff, VH′ (xh ) = a = c′ (x0 ). Suppose xh > x0 . Then according to Bolzano’s theorem, the optimal amount is between x0 and xh , see Fig. 5.1. This is because f1 is continuous, VH′′ (x) < 0 and

t

h

t

price t

0

x

0

x quantity x

h

Figure 5.1: The optimal amount is between x0 and xh . The thick solid, the solid, and the dashed lines are the optimal tariff, the principal’s and the type H’s indifference curves, respectively. c′′ (x) > 0, f1 (x0 ) = VH′ (x0 ) − c′ (x0 ) > VH′ (xh ) − c′ (x0 ) = 0 and f1 (xh ) = VH′ (xh ) − c′ (xh ) < VH′ (xh ) − c′ (x0 ) = 0. This holds only if IR constraint is satisfied, that is, b is small enough. If xh < x0 , 65

the interval is (xh , x0 ) with similar reasoning. The principal finds the optimal amount iteratively by choosing a new amount x0 from the interval given by the previous iterations. This way the interval gets narrower in each iteration, and the optimal amount x∗h and the linear tariff that gives the optimal amount are found when xh is close enough to x0 . The adjustment dynamics is defined by choosing ai+1 = c′ (˜ x) . s.t. x˜ within all the intervals I0 , ..., Ii T One heuristic is to choose x˜ in the middle of k=0,...,i Ik . Obviously, if the principal knows something of the shapes of utility functions, x is chosen according to those beliefs. The optimal amount can also be found in one iteration by setting a nonlinear tariff t(x) = c(x) + δ, (5.3) see Fig. 5.2. Now the type H chooses xh so that VH′ (xh ) = c′ (xh ), giving one of the Pareto solutions. Thus, xh satisfies the necessary condition and is the optimal amount. The only requirement for the tariff in Eq. (5.3) is that δ is small enough, or the IR constraint for the highest type is satisfied. In the light of this, finding the optimal amount to the highest type seems trivial. It is, however, a stepping stone on a way to find the optimal amounts to the other types.

5.3

Main Algorithm

Because of the information rents, the optimal amounts of all but the highest type depend on the next types’ utilities, as seen from Eq. (3.8). We define f2 (xi ) = pi [Vi′ (xi ) − c′ (xi )] +

"

H X

pk

k=i+1

#

′ ′ (xi ) . Vi (xi ) − Vi+1

(5.4)

Now, the necessary condition at x∗i is f2 (x∗i ) = 0, and f2 is continuous, because the production cost and the utility functions were assumed to be continuously differentiable. The idea of the algorithm for the other types is the same as for the highest type; to evaluate the slope of the profit, f2 = ∂π/∂xi , and to find the root by observing 66

t

price t

0

δ

x* quantity x

Figure 5.2: The nonlinear tariff t(x) = c(x) + δ. The thick solid and dashed lines are the nonlinear tariff, and the agent’s indifference curves. the sign of the f2 . Although, there are some differences: • There are two utility functions instead of one. The principal cannot evaluate f2 simply by offering a linear tariff, because the two types will not choose the same amount from the linear tariff. • The principal will not get an interval in one iteration, in which the optimal

amount is. The principal will, however, get an end of the interval, if the function f2 is assumed to be suitable. Basically, we wish f2 to have a unique

root f2 (x∗i ) = 0, and f2 (xi ) > 0 when xi < x∗i and f2 (xi ) < 0 when xi > x∗i . This is natural, the profit increases as the principal starts to sell more to the type i. • The interval is for the slope of the tariff instead of the amount. The optimal amounts are found by a two-step algorithm. In the first step, the principal solves type i’s reaction x0 to a given slope α. This is achieved by offering a linear tariff with slope α and observing type i’s reaction x0 to it. Now, the principal knows the first unknown function in f2 for given x0 . Next, the principal could find, as explained earlier, the second unknown slope by changing the slope until the type i + 1’s reaction was x0 . This way the principal could evaluate f2 (x0 ), but it would take several iterations. Instead of this, we propose that the principal evaluates f2 by offering the type i + 1 a linear tariff with a

67

′ slope β, which is Vi+1 (x0 ) solved from the necessary condition, Eq. (5.4),5

P β = Vi′ (x0 ) + pi [Vi′ (x0 ) − c′ (x0 )] / H k=i+1 pk PH ′ = α + pi [α − c (x0 )] / k=i+1 pk .

(5.5)

Depending on type i + 1’s reaction x1 , the principal knows whether the profit is increasing or decreasing in x0 , because the function f2 is the slope of the profit, f2 = ∂π/∂xi . If x1 < x0 , she knows that the profit is increasing in x0 , because ′′ Vi+1 (x) < 0, f2 (x0 ) = pi [Vi′ (x0 ) − c′ (x0 )] + > pi [Vi′ (x0 ) − c′ (x0 )] +

"

H X

pk

"k=i+1 H X

#

pk

k=i+1

#

′ Vi′ (x0 ) − Vi+1 (x0 )

′ ′ Vi (x0 ) − Vi+1 (x1 ) = 0.

According to Bolzano’s theorem f2 has a root between x0 and the upper limit6 of xi . Similarly f2 (x0 ) < 0, if x1 > x0 . In this case, she does not know whether optimal amount lies above or below x0 , unless she assumes that the slope of the utility, f2 , is suitable. This is the reason to make assumptions of the utility functions. Without making the assumptions, this problem can be solved by adding more heuristics to the algorithm. Again, the principal finds the optimal amount iteratively by updating α depending on the higher type’s response. If we have a unique root, f2 (xi ) > 0 when xi < x∗i and f2 (xi ) < 0, when xi > x∗i , we may use the following algorithm. Algorithm 1 0. Choose α > 0. Repeat until type i+1’s response x1 close enough to type i’s response x0 1. Construct a linear tariff, t(x) = αx + b1 . 2. Observe the type i’s response x0 . 3. Construct a new tariff with slope β according to Eq. (5.5), t(x) = βx + b2 . If β ≤ 0, increase α and return to Step 1. 4. Observe the type i+1’s response x1 . 5. Update α: ∆α = γ(·)sign(x1 − x0 ), γ(·) > 0, and return to Step 1. We may also use the interval approach, where the new α is chosen within the interval. 5

This is the step that saves us a lot of iterations. We will not learn the exact gradient in x0 , but we will learn whether the gradient is positive or negative. With this information, we can adjust the slope α. 6 The first-best amount, if no other upper limit is known.

68

5.4

Price Algorithm and Interpretations

After finding the optimal amounts, the principal needs to learn the optimal prices according to Eqs. (3.5) and (3.6). The prices can be found in bottom-up order in type by giving take it or leave it -offers, that is, raising the prices until the agent rejects the bundle. The only problem of this is the identification of the agent. Suppose that the first i − 1 prices are optimal and the principal raises the price ti too much. The type i will deviate to bundle i − 1, but the principal cannot distinguish type i from i − 1. This problem can be solved by raising the price of

bundle i − 1 by small enough δ, 0 < δ ≤ Vi (xi−1 ) − Vi−1 (xi−1 ). Then the deviation is observed, because type i − 1 will not choose the bundle, but type i will. The optimal price is ti = tˆi − δ, where tˆi is the price found in the iteration. A linear tariff corresponds a fixed fee and a unit price. The unit price is found by the algorithm, whereas the fixed fee is found by giving take it or leave it -offers. The optimal tariff in the multi-agent case can be characterized almost as easily. The optimal bundles are chosen by the agents, if the principal offers a piecewise linear tariff. The tariff is constructed so that the slope of each piece is the slope of the corresponding type’s utility function at the optimal amount and the noncontinuity happens at the optimal amount, because we do not want the next type to choose any amount between the optimal amounts, see Fig 2.3. The slopes of the utility functions are obtained from the method; they are exactly the final α’s. Thus, the piecewise linear tariff classifies the agents according to how much they are willing to buy, and assigns a fixed fee and a unit price for each class. The adjustment process in one-agent model was defined in Chapter 5.2. The algorithm for the multi-agent model is more complicated. The algorithm can, however, be interpreted in the following way with the help of Bayesian learning. The idea of the first step, α-step, is to find the optimal amount. The final α provides the slope that implements the optimal tariff, that is, giving a linear tariff with this slope induces the agent to choose the optimal amount. The idea of the second step, β-step, is to update incorrect beliefs about higher type’s utility function. The principal experiments whether the current α satisfies the optimality condition. The result of this experiment is to either update her α or confirmation of optimality.

69

Chapter 6 Numerical Examples The algorithms presented in the previous chapter describes how the principal learns the incomplete information in the repeated game. Now, we illustrate how the algorithms work in examples with one, two, and six agents. In simple cases, with one or two agents, the method finds the optimum in few iterations. The sixagent case shows that more iterations is needed if high precision is demanded. It is also argued that the heuristics behind the algorithm strongly affect the outcome of the algorithm.

6.1

One-agent Case

Let us consider a case with one type of buyer. The buyer’s utility function is √ V (x) = 2 x and the principal’s cost function is c(x) = x2 . By differentiating the utility and cost functions, we get the optimal amount to be sold, x∗ = 2−2/3 ≈

0.63. The iteration starts from x10 = 0.5 and converges quickly to the optimal value. Two first iterations and the optimal amount, x∗ , are presented in Fig. 6.1.

6.2

Two-agent Case

Now, let us add another type of buyer to study the two-type case. Half of the customers are the lowest type and half are the highest type. The utility functions for √ √ the highest and the lowest type are VH (x) = 3 x and VL (x) = 2 x, respectively. We get the optimal amount to be sold to the highest type by differentiating the 70

t* 1 t

t

price t

0

2

t

2

x x1 0

2 x* x0 quantity x

x

1

Figure 6.1: Illustration of iteration in one-type case. The thick solid, the solid, the dashed, and the dotted lines (below) are the tariffs, the principal’s and the agent’s indifference curves, and the principal’s intervals, respectively. The tilted squares and the stars are the pivot points and the agent’s choices. Firstly, in the iteration, the principal offers a tariff going through her initial guess, the pivot (x10 , t0 ), according to Eq. (5.2). The slope of the tariff is tangent to the cost function at x10 . Secondly, the agent chooses the amount x1 of good that maximizes his profit. Notice the characteristic that the amount is chosen in such a way that the indifference curve is tangent to the tariff. The new amount, the pivoting amount x20 , is picked in the middle of the lower and upper bound.

71

profit, x∗H = (3/4)2/3 ≈ 0.83. We can also calculate the optimal amount in the lowest type’s bundle according to Eq. (3.8), x∗L = (1/4)2/3 ≈ 0.40. The optimal bundles are presented in Fig. 6.2.

tH*

price t

tUP L tL*

xL*

xUP L quantity x

xH*

Figure 6.2: Illustration of utilities and optimal prices in two-type case. The solid, the dashed, and the dash-dotted lines are the principal’s, the lowest and the highest type’s indifference curves, respectively. In addition to optimal bundles, the first-best solution of the lowest type (xUL P , tUL P ) is shown. This amount is also the upper limit to the optimal amount to be sold to the lowest type in the two-type case. The upper limit is found by setting the tariff as cost function or by the converging iteration.

The iteration for finding the optimal amount for the highest type is analogous to the one-type case. We present the first steps of the iteration used in finding the optimal amount for the lowest type in Table 6.1. α and α are current lower and upper bounds of α. The lower bound in the first iteration is in parenthesis, because at that point the principal does not know it to be lower bound for sure, this was discussed in Chapter 4.4. The iteration starts from x10 = 0.55 (the corresponding α = VL′ (0.55) ≈ 1.35 must

be known) and converges slowly to the optimal value. Firstly, in the iteration, the principal offers the lowest type a tariff with a slope α, and gets the lowest type’s response x0 . Secondly, the principal offers the highest type a tariff with slope β according to Eq. (5.5), and gets the highest type’s response x1 . Thirdly, we update the slope α. Before finding an upper limit to α, we update α as follows ∆α = c′ (x1 ) − c′ (x0 ). This way the slope α increases, that is, x0 decreases. After finding the upper limit, we start an interval search, in which we update the 72

Table 6.1: The α iteration 0 1 1.35 2.01 2 3 1.68 4 1.51 5 1.60 ∞ 1.59

iteration for finding the low type’s amount in two-type case. x0 x1 α α 1.26 10000 0.55 0.88 (1.35) 10000 0.25 0.18 1.35 2.01 0.35 0.32 1.35 1.68 0.44 0.48 1.51 1.68 0.391 0.387 1.51 1.60 0.40 0.40 1.59 1.59

Table 6.2: index pi Vi (x) H 0.05 4x9/10 4 0.1 4x8/9 3 0.3 4x7/8 0.2 4x6/7 2 1 0.15 4x5/6 0.2 4x4/5 0

The data and the results of the extensive case. x1∗ x∗ periods final interval 911 911 21 0.35 841 807 46 1.34 763 725 52 1.18 676 485 46 0.75 578 194 28 0.26 468 96 42 0.12

lower and upper bounds, and pick the new α in the middle of the interval. If the response x1 6= x0 , one of the limits is updated, and if x1 ≈ x0 , the optimal amount is found.

6.3

Six-agent Case

Let us study an example with six different types. The cost function is 0.001x2 , and the data and the results are presented in Table 6.2. The columns of the table are the type index, the probability of arrival, the utility function, the firstbest solution, the optimal amount, the needed periods and the final length of the interval. The optimal amount to the highest type is found by the algorithm in Section 5.2. The stopping criteria is when the interval is shorter than 2, and the initial pivot amount is x10 = 500. The other optimal amounts are found by the algorithm in Section 5.3 in top-down order. The stopping criteria is when the interval of α is shorter than 0.0005, and the initial α is chosen from the previous ′ iteration, αi0 = Vi+1 (x∗i+1 ) = c′ (x∗i+1 ). The initial lower and upper bounds to α are α0 = 0 and α0 = 4, respectively. α is updated in the following manner: ∆(α) = −0.01, if no lower bound has been found, ∆(α) = +0.01, if no upper bound has been found, and otherwise α = (α + α)/2.

The simulation of the extensive case was done in such manner that there was

73

no information on the customers’ preferences, excluding probabilities of arrival, initial lower and upper bounds of α, 0 and 4, and the initial guess of optimal amount x0H , 500. We learn the highest and the other type’s optimal amounts accurately in approximately 20 and 40 iterations, respectively. Because only one end of the interval is updated on each iteration, the algorithm for the other types needs more iterations than the algorithm for the highest type, if the same precision is needed. The number of iterations was great, because the precision was high and the heuristic ∆(α) = ±0.01 was poorly chosen. The number of iterations can

be halved with better heuristics.

Note that the optimal amount to the highest type could have been found in one iteration by offering nonlinear tariff according to Eq. (5.3). Note that in this case it was profitable to serve all the customers, and the deviation from the firstbest solution was bigger for the lowest types and smaller fractions of population. These observations are used in heuristics in the next chapter.

74

Chapter 7 Heuristics We have shown that the principal may use an adjustment process to solve the repeated game of incomplete information. We say that Bayesian Nash equilibrium is the result of a learning process. As seen from the examples in Chapter 5, the proposed method may take a lot of time to converge to the exact solution. To counter this, we may think of ways of improving the revealing of information, or we may think of inexact solutions that are sufficiently good but are found more easily. First, we deal the issue of not observing the type of the agent, and improving the gathering of information. Second, we propose heuristics that give inexact solution but with considerably less iterations. The heuristics range from brute force methods to sophisticated methods that utilize the dynamics of the system. The idea is to compare the methods in the view of simplicity, the number of needed iteration, and the quality of solution.

7.1

Acquiring Information

One important question regarding the learning is whether the principal can observe the Nature’s move afterwards. In some situations, it may be natural to think that the principal has to set the tariff before knowing the type, but she observes the agent’s type afterwards. This is not, however, required. The principal may differentiate the types, for example, by offering a linear tariff and observing the all choices of different types. Now, compare two different approaches to extracting the information. In the first 75

approach, the principal acquires sufficient information from the agents by offering linear tariffs. With this information, she constructs a tariff that maximizes her expected profit. After this, she may update the solution by making few improvement steps with the method presented in Chapter 4. In the second approach, the principal uses the method presented in Chapter 4 to find the optimal amounts. In the method, the principal acquires the information dynamically; she keeps extracting information and updating her tariff consecutively. If the revealing of the agent’s type requires the choices of all the types, the first approach seems to be better. Why not utilize all the information available. On the contrary, if α and β-steps can efficiently be done compared to asking all the types their preferences, the second approach is better. This requires that the principal may offer certain type a linear tariff and observe the type afterwards. These aspects should be borne in mind when comparing different heuristics. Considering the whole process, the method presented in Chapter 4 is not sufficient. Firstly, we wish that there is an initial interval. One way to guarantee this is to find the amounts in top-down order in type. The amount for the highest type is found by converging one-agent algorithm, and after this the lower types’ amounts are bounded by zero and the higher types’ amounts. Thus, it is natural to think that the amounts are found in top-down order. On the other hand, if the principal finds the highest type’s amount by nonlinear tariff t = c(x) + δ, she also gets upper limits to the other types’ amounts, because the second-best amounts are smaller than the first-best amounts. Secondly, we wish that the initial guess in the method is as good as possible. Before finding a reasonable upper bound to α, the first-best amounts and the previous iterations can be utilized in defining α. One possibility is to start from the first-best amount (actually the corresponding slope) and increase the slope α until she gets an upper bound, or the response is small enough and the principal may conclude that that type should not be served at all. Thirdly, all the previous iterations should be utilized in the method. This can be employed in many ways. The idea is to notice that β-step is useless, if we already know the consequences of it from the previous iterations. One idea is to benefit from the previous type’s α-steps in β-steps, because both αi+1 and βi ′ iterations refer to type i + 1, or to Vi+1 . Another idea is to notice the type i’s first-best amount and its slope αi1∗ . Because we are only interested in whether 1∗ x1 < x0 or not, this can be reasoned with previous data if β < max(αi+1 , αi+1 ) 1∗ 2∗ 2∗ for Vi′′ < 0, x2∗ i ≤ xi and xi ≤ xi+1 . By generalizing this, we get that β-step

76

should be done only if β < β < β, where β = max(γ | (V ′ )−1 (γ) > x0 ) and

β = min(γ | (V ′ )−1 (γ) < x0 ). The reason for decreasing especially the β-steps is that they do not contribute to optimality; in β-step, amounts close to x2∗ i are sold to type i + 1. As a conclusion, all the information that is available should be utilized. The dynamics of the problem should be exploited: the first-best amounts, the shape of utility functions, and the fact that optimal amounts are increasing in type. The information should be gathered efficiently depending on the problem. If α/β-steps cannot be done efficiently, the information should be gathered collectively.

7.2

Inexact Solutions

By exact solution we mean that the amounts found in the iteration and the optimal amounts are closer than given accuracy. It can be shown that the presented method reaches any given accuracy by increasing the number of iterations. Thus, we say that the method gives exact solutions. On the contrary, inexact methods give solutions that may deviate considerably from the optimal solution. We want to compare how close the inexact solutions are compared to the presented method with certain number of iterations. The inexact methods range from collective extracting and utilizing the first-best solution to linear tariffs. The collective extracting corresponds a brute-force method, where the principal acquires the information by offering all the agents tariffs with certain slopes. After this, she constructs tariff that maximizes her expected profit with the data available. In the second heuristics, the principal estimates the second-best amounts with the help of first-best amounts. The linear tariffs represent simple alternatives to the other presented methods.

7.2.1

Collective Extracting Method (CE)

In the method, the principal gets the responses from all the agents by offering linear tariffs from slope γ CE to γ CE with step ∆CE γ . Besides, the principal knows the first-best solution. The principal finds the optimal amount by evaluating the necessary condition with the given data. She start from the first-best amount1 , 1

The principal now poses the gathered information, and the following steps are only calculations with the data.

77

and decreases the amount with step ∆CE until f2 (x) < 0. f2 is evaluated by apx ′ proximating Vi′ (x) and Vi+1 (x) linearly with the given data. Finally, the optimal amounts are approximated linearly, x∗ = f2 (x)(x − x)/(f2 (x) − f2 (x)) + x , where

f2 (x) > 0, f2 (x) < 0, and x = x + ∆CE x .

7.2.2

Utilizing the First-Best Solution (FB)

The second-best solution can be evaluated with the first-best solution by approximating the information rent. Remembering the interpretation of necessary condi′ tion, the term Vi′ (xi ) −c′ (xi ) corresponds the profit of type i and Vi′ (xi ) −Vi+1 (xi )

the information rent in Eq. (3.8). Now, if we estimate that the V ′ changes as fast as 2c′ , we get P ′ ′ ∆Fα,iB ≈ 1/3(Vi′ − c′ ) = ( H k=i+1 pk )(Vi − Vi+1 )/(3pi ) . PH 1∗ )/(3pi ) ≈ ( k=i+1 pk )(αi1∗ − αi+1

(7.1)

Now, the optimal amount is given by type i’s response to slope αi1∗ − ∆Fα,iB .

7.2.3

Linear Tariffs

We propose three different linear tariffs. First one is continuous linear (CL) tariff. Because the tariff is continuous, all of the possible surplus is not extracted from the higher types. The price is set so that the lowest type is willing to take his bundle. The slope is adjusted so that the expected profit is maximized. Second one is discontinuous linear (DL) tariff. The discontinuities are at the optimal amounts. The prices are set so that all the possible surplus is extracted from the agents. Each price is set so that each agent is indifferent between his and the previous bundle. The slope is adjusted so that the expected profit is maximized. Third one (DLH) is special case of discontinuous linear tariff. The slope is set exactly so that the highest type chooses the first-best (also secondbest) amount.

7.3

Comparison

We study three different examples; cases with two, three and six different types. The two and six-agent cases are the same that were studied in Chapter 5. The 78

three-agent case is modified from the six-agent case. We are interested in how good solution each heuristic give. We measure this by examining the amounts, the expected profit, and the needed iterations in the solution. We compare the first-best solution, second-best solution given by improved method (IM), and the inexact solutions presented earlier in this chapter. The improved method is the method in Chapter 4, but added the ideas in this chapter. The amounts are found in top-down order, and the slope in α-step is determined by P 1∗ 2 αi0 = αi1∗ − ( H and αii+1 = αi1∗ + 2(αii − αi1∗ ), until k=i+1 pk )(αi − αi+1 )/(4pi ) the upper bound is found, and then αii+1 = (αi + αi )/2. β-step is done only if it is necessary; we notice that the previous αi+1 -step can be exploited in βi -step.

7.3.1

Six-agent Case

The data in the first example is the same as in Chapter 5; c(x) = 0.001x2 , u(·) = 4[x9/10 , x8/9 , x7/8 , x6/7 , x5/6 , x4/5 ], and p(·) = [0.05, 0.1, 0.3, 0.2, 0.15, 0.2]. The results are presented in Table 7.1. First, we compare the improved and the collective extracting methods in finding the exact solution. The improved method needs 75 and 125 iterations to reach accuracies ∆IM = 1.34 and ∆IM = x x 0.05, whereas 80 iterations with the collective extracting method produces errors greater than 40. As the number of iterations is increased to 1000, the CE method produces as accurate solution as IM method. As mentioned above, the iterations are not exactly comparable. Each iteration in the CE is more arduous than α/β-step in the IM. Also notice that the outcome of CE depends greatly on how the iterations are organized. Here, the slopes were evenly distributed from slope = 1. γ CE = 0.8 to γ CE = 2.2, and ∆CE x Second, we compare the inexact solutions. All the three linear tariffs were determined by calculating profits with slopes on given interval with step ∆CL = α ∆DL = 0.01, and choosing the slope giving the best profit. The linear tariffs α gives worse profit than the first-best solution. On the contrary, the CE and the FB gives better profit. The FB gives especially good outcome compared to the needed iterations; only first-best amounts were needed in the determining of the solution. Third, we examine how many iterations each amount took and how big was the final interval. The IM needed approximately 10-20 iterations per type to produce 2 1∗ αi

and αi+1 are known because the first-best solution is known and the amounts are found in top-down order (and we use different algorithm for the highest type).

79

Table 7.1: The results xH x4 x3 x2 Solution 1st-best 910.6 841.2 763.3 675.8 IM,125 · 806.8 724.8 485.3 IM,75 · 807.2 724.3 485.0 CE,1000 · 807.6 725.1 487.3 CE,80 · 820.3 727.1 501.8 CE,20 · 841.2 763.3 603.0 FB · 743.8 667.0 353.2 DLH · 412.0 186.0 83.8 DL 1613 689.3 294.0 125.1 CL 1523 654.3 280.7 120.1 H 4 3 Iterations α 0 7/12 7/12 (75/125) β 0 5/10 5/10 Error 0 1.17/.04 1.20/.04

of the first example. x1 x0 Profit 577.5 467.8 352.1 194.2 95.8 433.7 194.3 95.9 433.7 196.2 95.7 433.7 238.7 109.9 433.0 485.6 198.0 408.1 154.5 120.1 427.1 37.6 16.7 256.1 53.0 22.3 296.3 51.2 21.7 276.5 2 1 0 9/14 9/14 8/13 9/14 9/14 7/12 0.97/.03 .93/.03 1.31/.04

Table 7.2: The results of the second example. 2 1 0 Profit Solution 1st-best 972.5 780.2 565.1 472.3 2 1 0 IM,25 · 717.2 427.0 487.6 6 7 CE,10 · 780.2 509.7 481.2 Iterations α 0 β 0 6 6 FB · 628.6 347.2 480.2 Error 0 3.64 2.51 DLH · 134.0 21.8 220.2 DL 1229 161.5 25.0 228.4 CL 1229 161.5 25.0 204.4 accurate enough solution. As a rule of thumb, it can be said that one α/βiteration halves the interval. The CE needed more iteration to reach the same accuracy. Twenty iterations was not sufficient to produce good enough solution.

7.3.2

Three-agent Case

The data of the three-agent case is modified from the six-agent case; c(x) = 0.001x2 , u(·) = [4.3x9/10 , 4.1x7/8 , 3.9x5/6 ], and p(·) = [0.1, 0.3, 0.6]. The parameters in the heuristics were ∆IM = 5, γ CE = 0.8 to γ CE = 2.2, ∆CE = 1, x x CL ∆DL α = 0.1, and ∆α = 0.05.

The IM needed 25 iterations to reach the given accuracy 5. In the CE, 10 iterations were not sufficient. The CE and the FB produced better profit than the first-best, but considerably worse than the IM. Again, the linear tariffs produced worse outcome than the other heuristics.

80

Table 7.3: The results of the third example. L Profit Solution H 1st-best .83 .63 1.221 H L IM,13 · .40 1.258 7 CE,30 · .42 1.258 Iterations α 0 β 0 6 FB · .52 1.248 Error 0 0.006 DLH · .37 1.258 DL .84 .37 1.258 CL .80 .35 1.182

7.3.3

Two-agent Case

√ √ This example is the same as in Chapter 5; c(x) = x2 , u(·) = [3 x, 2 x], and p(·) = [1/2, 1/2]. The parameters in the heuristics were ∆IM = 0.01, γ CE = 1 to x CL γ CE = 2.5, ∆CE = 0.01, and ∆DL x α = ∆α = 0.01. The IM needed 13 iterations to reach the given accuracy 0.01. In the CE, 30 iterations were sufficient to produce good enough solution. Excluding the FB and CL, the heuristics produced approximately the same solution. The FB produced better profit than the first-best solution. The linear tariffs produced good enough solution in this simple example, but the CL could not grasp all the surplus from the agents.

81

Chapter 8 Discussion 8.1

Summary

We studied learning in a distinct situation. We presented a buyer-seller game and its solution, Bayesian Nash equilibrium, in Chapters 2 and 3. The game is a principal-agent problem with hidden information, also known as adverse selection or mechanism design. In the game, the principal offers a tariff to the agent without knowing the agent’s utility. The game can be modeled as Bayesian game, where the principal knows the possible types1 , but does not observe the type. This situation can be interpreted as constraint that the seller must offer the same tariff to all the buyers2 . The issue of learning was studied in Chapters 1 and 4. In learning we are interested in whether players learn to make the Nash equilibrium decisions by playing the game repeatedly and observing the other players’ actions. In this case, the result is not Nash equilibrium, which corresponds the first-best solutions, but Bayesian Nash equilibrium. This is because the seller must offer the same tariff to the whole population of buyers. The learning models are usually divided into two classes: belief-based and reinforcement-based models. In reinforcementbased models, the strategies leading to better payoffs will become more likely as time goes by. In belief-based models, players determine their action, e.g. best response, according to historical play. One example of such models is Cournot adjustment, where players react to opponent’s previous action. The adjustment process does not fit to the traditional learning models because 1 2

Common Prior Assumption. This is sometimes called second-degree price discrimination.

82

of the structure of the game. The game is not finite; players do not have finite number of actions. In addition, players do not observe the Nature’s move or others’ strategies. The principal only knows how much was bought, but not by whom. The Bayesian game describes the effect of information in the game. The equilibrium gives the solution when the distribution of types is known. We extend the incomplete information model by assuming that the game is repeated and the principal does not know the possible types. We depart from the assumption of complete information, but compensate this with experimenting adjustment process that extracts information. The idea of adjustment process is not to return back to complete information, but to sufficient information. In the process, the principal seeks the solution by evaluating the optimality condition. We model the incomplete information by making nonparametric approach. In this approach, we assume that the buyer’s private information, type, is the whole utility function. This means that the utility functions do not need to follow any parameterization. The utility functions are arbitrary; knowing one of them will not help with the others. This is what we mean by not knowing the possible utility functions. If there was one dimensional parameterization, the principal could offer a linear tariff and reveal the type afterwards by calculating the value of parameter with the help of agent’s action. Despite the nonparametric approach, we assume single-crossing property, which means that the utility functions can be ordered in type. This assumption means that the difference of valuations increases as the amount increases. With this and few other assumptions, the solution of the problem is given by the necessary conditions presented in Chapter 3. It should be noted that in general the conditions may not be sufficient. The learning relies on assumption of myopicity. The linear tariffs are used to extract information from myopic agents. A myopic agent chooses his action from a linear tariff so that the slope of his utility function is revealed. Happily, this information can be used to evaluate the necessary conditions, which consist of slopes of utility functions. The adjustment process suggests one systematic way of extracting the information. In the end the solution satisfies the necessary condition. The adjustment process was studied in Chapters 5, 6 and 7. The efficiency of the algorithm depends strongly on the slope update parameters. With reasonable update parameters, each of the optimal amounts are found in good accuracy in 83

several iterations. The algorithm can be enhanced, if there is additional information about the dynamics of the system, e.g. any knowledge of utility functions. We also presented heuristics that do not give exact solutions, but use considerably less iterations.

8.2

Adjustment Process as a Learning Model

We suggested a new approach in modeling incomplete information. Instead of the deductive approach, where we are interested in describing the equilibrium of the problem, we presented a learning model, where the seller learns the buyers’ preferences and the equilibrium by repeating the game; and this is the inductive approach. The adjustment process can be seen as an iterative solution method under limited information. The method uses only local information; thus, it provides an interesting alternative to solving general mechanism design problems. Traditional mechanism design assumes that the agents’ preferences are known for all, possibly numerous alternatives; on the contrary, our approach evaluates preferences only for the necessary alternatives. Thus, the adjustment process answers how the equilibrium can be reached and defined. The adjustment process has similarities with the traditional learning models: • As in reinforcement learning, the seller makes his decisions under limited

information. This issue is also close to imperfect monitoring, where the players do not necessarily observe the opponents’ actions. In this case, the seller does not observe the buyers’ type.

• As mentioned earlier, the adjustment process is an iterative solution method

like fictitious play originally was meant to be. The difference to fictitious play is the form of the game and the players’ information about the game.

• As in rule learning, the seller evaluates optimality conditions and adjusts the tariff if the conditions are not met. The gradient evaluation can loosely be seen as a rule. • As in self-confirming or conjectural equilibrium, the play of the game only confirms the seller’s conjectures. The process may converge, but is the convergent point the optimum and did the buyers act truthfully (myopicity). The necessary conditions may not be sufficient, and the buyers may manipulate their actions so that a desirable outcome is reached. The adjustment process is not very similar to traditional learning models because of the form of the problem; the players do not have finite number of actions and 84

the only active player is the seller. The problem could actually be seen as an individual decision making problem rather than a problem of game theory.

8.3

Uncertainty, Frame Problem and Practical View

The use and end of reason is not the finding of the sum and truth of one, or a few consequences, remote from the first definitions and settled significations of names; but to begin at these, and proceed from one consequence to another. For there can be no certainty of the last conclusion without a certainty of all those affirmations and negations on which it was grounded and inferred. As when a master of a family, in taking an account, casteth up the sums of all the bills of expense into one sum; and not regarding how each bill is summed up, by those that give them in account, nor what it is he pays for, he advantages himself no more than if he allowed the account in gross, trusting to every of the accountant’s skill and honesty: so also in reasoning of all other things, he takes up conclusions on the trust of authors, and doth not fetch them from the first items in every reckoning (which are the significations of names settled by definitions), loses his labour, and does not know anything, but only believeth. Thomas Hobbes [91, Chapter V]. We argued that incomplete information could be modeled as ignorance or internal uncertainty [106, 105] besides the “inevitable” external uncertainty. Then, we introduced learning as a way of removing this ignorance. At first, it seems that the uncertainty might be removed completely by learning. But should not the result then be Nash equilibrium instead of Bayesian Nash equilibrium? No, if we interpret the IC-constraints as a law that forbids price discrimination. Now, this characteristic of problem is rather a constraint than uncertainty. We used a nonparametric approach to model this ignorance mathematically. We extended the private information, type, as the whole utility function. This means that the utility functions are arbitrary, which is exactly what we wanted. We wanted to model the situation in which the seller might not know anything about the sellers’ utilities. Traditionally the incomplete information is presented by a parametric probability distribution function, which arises many questions. Is the parameterization correct or does it even cover the range of possibilities. This problem is generally known as a frame problem [69], which was introduced 85

by McCarthy and Hayes [130] in artificial intelligence3 . In this case, we have to frame the complex problem the right way to achieve a solution. We need to determine what is relevant and what is irrelevant. We frame the problem more generally; we do not assume a parametric model. In the end, we want to model realistic situations. Tversky and Kahneman [194, 105] noticed that humans use simple heuristics in decision making. It is also noticed that learning models describe experimental results better than traditional equilibrium theory, see [55, 206, 66]. Thus, it is justified to examine a model, where the utility functions are not known due to ignorance and where we use simple adjustment process to describe the dynamics of the game. If the agents are not myopic, the agents may mislead the principal by cheating. The agents may affect the equilibrium with two methods: changing the probabilities of classes or their utility functions. The agents may mislead the principal by choosing coordinated actions that benefit in the long run. The agents notice the dynamic effects of the situation, and thus reject the principal’s offers to make her lower her prices. We could also change the players’ roles; the agents would be the intelligent ones and the principal the one using some heuristic. It is, however, more natural to think that the monopolistic firm is the rational and intelligent one.

3

The problem arose from the question of how to determine efficiently which things remain the same in changing world. After this, the problem has been given different formulations.

86

Appendix A Proof of Proposition 1 a) We shall write the necessary conditions of the maximization problem max x,t

s.t.

H X i=1

pi (ti − c(xi ))

Vi (xj ) − Vi (xi ) + ti − tj ≤ 0, ∀i, j, j 6= i (IC) ti − Vi (xi ) ≤ 0, ∀i (IR)

(1)

with the help of Kuhn-Tucker conditions:

pi +

X j6=i

λIC i,j −

X k6=i

−pi c′ (xi ) − Vi′ (xi ) λIC i,j

IR λIC k,i + λi = 0, ∀i

X

λIC i,j +

j6=i

X k6=i

′ IR Vk′ (xi )λIC k,i − Vi (xi )λi = 0, ∀i

[Vi (xj ) − Vi (xi ) + ti − tj ] = 0, ∀i, j, j 6= i

λIR i [ti − Vi (xi )] = 0, ∀i

(2) (3) (4) (5)

IC λIR i , λi,j ≤ 0, ∀i, j, j 6= i

(6)

IC, IR

(7)

IR From Eqs. (2) and (6), it follows that at least one of λIC is negative, i,j and λi because the probability of arrival is positive. According to the complementary

slackness (CS) conditions, (4) and (5), at least one of the inequalities must be binding, which completes the proof. Notice that if neither of the constraints were binding, the principal could raise the prices and get better off.

87

b) By assuming that λIR i = 0 ∀i and summing Eq. (2) for all the types, we get X i

"

pi +

X j6=i

λIC i,j −

X k6=i

#

IR =0⇔ λIC k,i + λi

X pi + λIR = 0. i i

P This is contradicting, since i pi = 1. Thus λIR 6= 0 for some i. If none of the i IR constraints was binding, the principal could raise all the prices with the same constant until one of the IR constraints is binding and get better profit. c) Let us study Eq. (3.4) for i 6= 0, Vi (xi ) − ti ≥ max[Vi (xj ) − tj ] j6=i

≥ Vi (x0 ) − t0 > V0 (x0 ) − t0 ≥ 0. Thus according to b), the IR constraint must be binding for the lowest type. Note that we assumed that Vi (x0 ) > V0 (x0 ), which will not hold if the lowest type is not served, x∗0 = 0. In general, the IR constraint will hold only to the lowest type that is served, if the reservation utility is the same for all the types. Notice that according to a), we deduce that at least one of the IC constraints must be binding for each of the higher types. d) If we assume equality in Eq. (3.4) for the lowest type, and notice c), we get V0 (x0 ) − t0 = max[V0 (xj ) − tj ] j6=0

= V0 (xk ) − tk , k 6= 0

= V0 (xk ) − Vk (xk ) + max[Vk (xj ) − tj ] j6=k

≥ V0 (xk ) − Vk (xk ) + Vk (x0 ) − t0 ⇔ V0 (x0 ) − V0 (xk ) ≥ Vk (x0 ) − Vk (xk ), which contradicts the assumption that Vk′ (x) > V0′ (x). e) We show that the IC constraints are binding respect to the previous type. We show that such pricing is feasible and optimal. The optimality follows from the fact that the prices cannot be raised, because then the binding IC constraints would be violated. On the other hand, the principal loses some of her profit by decreasing one of the prices, because none of the prices can be raised by decreasing one of the prices. By observing type i’s binding IC constraint Vi (xi ) − ti = Vi (xi−1 ) − ti−1

⇔ Vi (xi ) − Vi (xi−1 ) = ti − ti−1 ,

88

we use the single crossing condition and get that for type h > i

1

Vh (xi ) − Vh (xi−1 ) > Vi (xi ) − Vi (xi−1 ) = ti − ti−1 ⇒ Vh (xi ) − ti > Vh (xi−1 ) − ti−1 ,

which means that the type h values the bundle i more than the bundle i − 1, and for type l < i 1 Vl (xi ) − Vl (xi−1 ) < Vi (xi ) − Vi (xi−1 ) = ti − ti−1 ⇒ Vl (xi ) − ti < Vl (xi−1 ) − ti−1 , which means that the type l values the bundle i−1 more than the bundle i. Going through all the binding IC constraints, we get that type i genuinely prefers his bundle to types 0 . . . i − 2 and i + 1 . . . H, which means that the IC constraints are not binding for these types. Thus, the only binding IC constraint is the previous type’s IC constraint. Note that we assumed in the proof that xi > xi−1 . See Maskin and Riley [124] and Spence [183]. f) Contrary, assume that x∗i > x∗i+1 . Summing the IC constraints Vi+1 (x∗i+1 ) −

t∗i+1 ≥ Vi+1 (x∗i ) − t∗i and Vi (x∗i ) − t∗i ≥ Vi (x∗i+1 ) − t∗i+1 , we get

⇒ Vi+1 (x∗i+1 ) + Vi (x∗i ) ≥ Vi+1 (x∗i ) + Vi (x∗i+1 )

⇔ Vi (x∗i ) − Vi (x∗i+1 ) ≥ Vi+1 (x∗i ) − Vi+1 (x∗i+1 ) Z x∗i Z x∗i ′ ′ ⇔ Vi (t)dt ≥ Vi+1 (t)dt, x∗i+1

x∗i+1

which contradicts the assumption. See Spence [183].

1

Holds if xi > xi−1 .

89

Bibliography [1] Akerlof G. A. The Market for Lemons: Quality Uncertainty and the Market Mechanism. Quarterly Journal of Economics 89: 488-500. 1970 [2] Aumann R. J. Agreeing to Disagree. The Annals of Statistics 4: 1236-1239. 1976 [3] Aumann R. J. Correlated Equilibrium as an Expression of Bayesian Rationality. Econometrica 55: 1-18. 1987 [4] Barelli P. Beyond the Common Prior Assumption. Working Paper. Columbia University. Department of Economics. New York 2003 [5] Baron D. P. and Myerson R. B. Regulating a Monopolist with Unknown Costs. Econometrica 50: 911-930. 1982 [6] Battigalli P. Comportamento Razionale ed Equilibrio nei Giochi e Nelle Situazioni Strategiche. Unpublished Dissertation. Bocconi University 1987 [7] Battigalli P. On Rationalizability in Extensive Games. Journal of Economic Theory 74: 40-61. 1997 [8] Battigalli P. and Guaitoli D. Conjectural Equilibria and Rationalizability in a Macroeconomic Game with Incomplete Information. Working Paper 1988-6. Bocconi University 1988 [9] Battigalli P. and Guaitoli D. Conjectural Equilibria and Rationalizability in a Game with Incomplete Information. In: Battigalli P., Montesano A. and Panunzi F. Decision, Games and Markets. Kluwer. Dordrecht 1997 [10] Battigalli P. and Siniscalchi M. Interactive Beliefs, Epistemic Independence and Strong Rationalizability. Research in Economics 53: 247-273. 1999 [11] Bellman R. E. Dynamic Programming. Princeton University Press. Princeton 1957 [12] Bernheim D. Rationalizable Strategic Behavior. Econometrica 52: 1002-1028. 1984 [13] Binmore K. Modeling Rational Players I. Economics and Philosophy 3: 179214. 1987 [14] Binmore K. Modeling Rational Players II. Economics and Philosophy 4: 955. 1988 [15] Blackburn J. M. Acquisition of Skill: An Analysis of Learning Curves. IHRB Report No. 73. 1936 [16] Blackwell D. Controlled Random Walks. In Proceedings International Congress of Mathematicians III: 336-338. North-Holland. Amsterdam 1956. [17] Blume L. E. and Easley D. Rational Expectations and Rational Learning. Economics Working Paper Archive at WUSTL. Game Theory and Information # 9307003. 1993 90

[18] Bomze I. Noncooperative Two-Person Games in Biology: A Classification. International Journal of Game Theory 15: 31-57. 1986 [19] Bonanno G. and Nehring K. How to Make Sense of the Common Prior Assumption under Incomplete Information. International Journal of Game Theory 28: 409-434. 1999 [20] Brandenburger A. Knowledge and Equilibrium in Games. The Journal of Economic Perspectives 6: 83-101. 1992 [21] Brenner T. Can Evolutionary Algorithms Describe Learning Processes? Journal of Evolutionary Economics 8: 271-283. 1998 [22] Brown G. W. Some Notes on Computation of Games Solutions. Rand Report P-78. The RAND Corporation. California. Santa Monica 1949. [23] Brown G. W. Iterative Solutions of Games by Fictitious play. In Koopmans T.C. Activity Analysis of Production and Allocation. John Wiley 1951 [24] Brown G. W. and von Neumann J. Solutions of Games by Differential Equations. Annals of Mathematical Studies 24: 73-79. 1950 [25] Bush R. R. and Mosteller F. A Mathematical Model for Simple Learning. Psychological Review 58: 313-323. 1951 [26] Bush R. R. and Mosteller F. A Model for Stimulus Generalization and Discrimination. Psychological Review 58: 413-423. 1951 [27] Bush R. R. and Mosteller F. Stochastic Models for Learning. John Wiley. New York 1955 [28] B¨orgers T. On the Relevance of Learning and Evolution to Economic Theory. The Economic Journal 106: 1374-1385. 1996 [29] B¨orgers T. and Sarin R. Learning Through Reinforcement and Replicator Dynamics. Journal of Economic Theory 77: 1-14. 1997 [30] B¨orgers T. and Sarin R. Naive Reinforcement Learning With Endogenous Aspirations. International Economic Review 41: 921-950. 2000 [31] Camerer C. F. Behavioral Game Theory: Experiments in Strategic Interaction. Princeton University Press. New Jersey 2003 [32] Camerer C. and Ho T.-H. Experience-Weighted Attraction Learning in Coordination Games: Probability Rules, Heterogeneity, and Time-Variation. Journal of Mathematical Psychology 42: 305-326. 1998 [33] Camerer C. and Ho T.-H. Experienced-Weighted Attraction Learning in Normal Form Games. Econometrica 67: 827-874. 1999 [34] Camerer C. F., Ho T.-H. and Chong J.-K. Sophisticated ExperienceWeighted Attraction Learning and Strategic Teaching in Repeated Games. Journal of Economic Theory 104: 137-188. 2002 [35] Camerer C. F., Ho T. H., Chong J.-K and Weigelt K. Strategic Teaching and Equilibrium Models of Repeated Trust and Entry Games. Working Paper. UCLA Department of Economics. Levine’s Bibliography # 506439000000000506. 2003 [36] Carlier G. Nonparametric Adverse Selection Problems. Annals of Operations Research 114: 71-82. 2002 [37] Chen Y. and Khoroshilov Y. Learning Under Limited Information. Games and Economic Behavior 44: 1-25. 2003 [38] Cheung Y.-W. and Friedman D. Individual Learning in Normal Form Games: Some Laboratory Results. Games and Economic Behavior 19: 46-76. 1997 [39] Conitzer V. and Sandholm T. Complexity of Mechanism Design. Proceedings 91

[40] [41] [42] [43] [44] [45] [46]

[47] [48] [49] [50]

[51] [52]

[53] [54] [55] [56] [57] [58] [59]

[60] [61]

of the 18th Annual Conference on Uncertainty in Artificial Intelligence (UAI02): 103-110. 2002 Cournot A. A. Recherches sur les Principes Math´ ematiques de la Th´ eorie de la Richesses. Hachette. Paris 1838 Cross J. G. A Stochastic Learning Model of Economic Behavior. Quarterly Journal of Economics 87: 239-266. 1973 Cross J. G. A Theory of Adaptive Economic Behavior. Cambridge University Press. Cambridge 1983 Dawkins R. The Selfish Gene. Oxford University Press. 1976 Demichelis S. and Ritzberger K. From Evolutionary to Strategic Stability. Journal of Economic Theory 113: 51-75. 2003 Easley D. and Rustichini A. Choice Without Beliefs. Econometrica 67: 11571184. 1999 Ehtamo H., Kitti M. and H¨am¨al¨ainen R. P. Recent Studies on Incentive Design Problems in Game Theory and Management Science. Optimal Control and Differential Games, Essays in Honor of Steffen Jørgensen: 121-134. 2002 Eichberger J. Bayesian Learning in Repeated Normal Form Games. Games and Economic Behavior 11: 254-278. 1995 Ellsberg D. Risk, Ambiguity, and the Savage Axioms. The Quarterly Journal of Economics 75: 643-669. 1961 Ely J. C. and Sandholm W. H. Evolution in Bayesian Games I: Theory. Working Paper. Boston University and University of Wisconsin. 2003 El-Gamal M. A., McKelvey R. D. and Palfrey T. R. A Bayesian Sequential Experimental Study of Learning in Games. Journal of the American Statistical Association 88: 428-435. 1993 Erev I. and Rapoport A. Coordination, “Magic”, and Reinforcement Learning in a Market Entry Game. Games and Economic Behavior 23: 146-175. 1998 Erev I. and Roth A. E. Predicting How People Play Games: Reinforcement Learning in Experimental Games with Unique, Mixed Strategy Equilibria. The American Economic Review 88: 848-881. 1998 Estes W. K. Toward a Statistical Theory of Learning. Psychological Review 57: 94-107. 1950 Estes W. K. and Burke C. J. A Theory of Stimulus Variability in Learning. Psychological Review 60: 276-286. 1953 Feltovich N. Reinforced-based vs. Belief-based Learning Models in Experimental Asymmetric-information Games. Econometrica 68: 605-641. 2000 Fisher R. A. The Genetic Theory of Natural Selection. Clarendon Press. Oxford. 1930 Fox C. R. and Tversky A. Ambiguity Aversion and Comparative Ignorance. The Quarterly Journal of Economics 110: 585-603. 1995 Friedman D. Evolutionary Games in Economics. Econometrica 59: 637-666. 1991 Friedman D. Evolutionary Economics Goes Mainstream: A Review of the Theory of Learning in Games. Journal of Evolutionary Economics 8: 423-432. 1998 Fudenberg D. and Kreps D. M. A Theory of Learning, Experimentation and Equilibrium in Games. Unpublished Paper. Stanford 1988 Fudenberg D. and Kreps D. M. Learning Mixed Equilibria. Games and Eco92

nomic Behavior 5: 320-367. 1993 [62] Fudenberg D. and Kreps D. M. Learning in Extensive Games, I: SelfConfirming Equilibrium. Games and Economic Behavior 8: 20-55. 1995 [63] Fudenberg D. and Levine D. K. Self-Confirming Equilibrium. Econometrica 61: 523-545. 1993 [64] Fudenberg D. and Levine D. K. Steady State Learning and Nash Equilibrium. Econometrica 61: 547-573. 1993 [65] Fudenberg D. and Levine D. K. Consistency and Cautious Fictitious Play. The Journal of Economic Dynamics and Control 19: 1065-1089. 1995 [66] Fudenberg D. and Levine D. K. The Theory of Learning in Games. MIT Press. Cambridge 1999 [67] Fudenberg D. and Levine D. K. Conditional Universal Consistency. Games and Economic Behavior 29: 104-130. 1999 [68] Fudenberg D. and Tirole J. Game Theory. MIT Press. Cambridge 1991 [69] Gabaix X. and Laibson D. A New Challenge for Economics: The Frame Problem. Forthcoming in Collected Essays in Psychology and Economics. Oxford University Press [70] Gentzkow M. Reinforcement Learning in Repeated Games. Unpublished ` Manuscript. Harvard University. Department of Economics. 2001 [71] Gilboa I. and Matsui A. Social Stability and Equilibrium. Econometrica 59: 859-867. 1991 [72] Gilboa I. and Schmeidler D. Case-Based Optimization. Games and Economic Behavior 15: 1-26. 1996 [73] Gilli M. Equilibrio ed Aspettative Nella Teoria dei Giochi e Nella Teoria Economica. Unpublished Dissertation. Bocconi University 1987 [74] Gilli M. Rational Learning in Imperfect Monitoring Games. Universit`a di Bari. Dipartimento di Scienze Economiche. 2000 [75] Gilli M. A General Approach to Rational Learning in Games. Unpublished Manuscript. Universit`a di Bari. Dipartimento di Scienze Economiche. 2000 [76] Grossman S. J. and Hart O. D. An Analysis of the Principal-Agent Problem. Econometrica 51: 7-46. 1983 [77] Hahn F. On the Notion of Equilibrium in Economics. Cambridge University Press. Cambridge 1973 [78] Hahn F. Exercises in Conjectural Equilibrium Analysis. Scandinavian Journal of Economics 79: 210-226. 1977 [79] Hahn F. On Non-Walrasian Equilibria. Review of Economic Studies 45: 1-18. 1978 [80] Halpern J. Y. Characterizing the Common Prior Assumption. Journal of Economic Theory 106: 316-355. 2002 [81] Hannan J. Approximation to Bayes Risk in Repeated Plays. In Drescher M., Tucker A. W. and Wolfe P. Contributions to the Theory of Games 3: 97-139. Princeton University Press. Princeton 1957 [82] Harley C. B. Learning the Evolutionarily Stable Strategy. Journal of Theoretical Biology 89: 611-633. 1981 [83] Harsanyi J. Bargaining in Ignorance of the Opponent’s Utility Function. Journal of Conflict Resolution 6: 29-38. 1962 [84] Harsanyi J. Games with Incomplete Information Played by Bayesian Players. Management Science 14: 159-182, 320-334, 486-502. 1967-1968 93

[85] Harsanyi J. Games with Randomly Distributed Payoffs: a New Rationale for Mixed-strategy Equilibria. International Journal of Game Theory 2: 123. 1973 [86] Hart S. and Mas-Colell A. A General Class of Adaptive Strategies. Journal of Economic Theory 98: 26-54. 2001 [87] Hendon E., Jacobsen H. J. and Sloth B. Fictitious Play in Extensive Form Games. Games and Economic Behavior 15: 177-202. 1996 [88] Herrnstein R. J. On the Law of Effect. Journal of the Experimental Analysis of Behavior 13: 243-266. 1970 [89] Hillas J. On the Definition of the Strategic Stability of Equilibria. Econometrica 58: 1365-1390. 1990 [90] Hillas J. and Kohlberg E. Foundations of Strategic Equilibrium. Economics Working Paper Archive at WUSTL. Game Theory and Information # 9606002. 1996 [91] Hobbes T. Leviathan. 1651 [92] Hofbauer J. and Sandholm W. H. On the Global Convergence of Stochastic Fictitious Play. Econometrica 70: 2265-2294. 2002 [93] Hofbauer J. and Sigmund K. Evolutionary Games and Population Dynamics. Cambridge University Press. 1998 [94] Hofbauer J. and Sigmund K. Evolutionary Game Dynamics. Bulletin of the American Mathematical Society 40: 479-519. 2003 [95] Holland J. Adaption in Natural and Artificial Systems. University of Michigal Press. Ann Arbor 1975 [96] Hopkins E. Learning, Matching, and Aggregation. Games and Economic Behavior 26: 79-110. 1999 [97] Hopkins E. A Note on Best Response Dynamics. Games and Economic Behavior. 29: 138-150. 1999 [98] Hopkins E. Two Competing Models of How People Learn in Games. Econometrica 70: 2141-2166. 2002 [99] Inderst R. Contract Design and Bargaining Power. Economics Letters 74: 171-176. 2002 [100] Jackson M. O. and Kalai E. Social Learning in Recurring Games. Games and Economic Behavior 21: 102-134. 1997 [101] Jordan J. S. Bayesian Learning in Normal Form Games. Games and Economic Behavior 3: 60-81. 1991 [102] Jordan J. S. Bayesian Learning in Repeated Games. Games and Economic Behavior 9: 8-20. 1995 [103] Jullien B. Participation Constraints in Adverse Selection Models. Journal of Economic Theory 93: 1-47. 2000 [104] Kaelbling L. P., Littman M. L. and Moore A. W. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4: 237-285. 1996 [105] Kahneman D., Slovic P. and Tversky A. Judgment under Uncertainty: Heuristics and Biases. Cambridge University Press. 1982 [106] Kahneman D. and Tversky A. Variants of Uncertainty. Cognition 11: 143157. 1982 [107] Kajii A. and Ui T. Incomplete Information Games with Common Multiple Priors. Unpublished Manuscript. Osaka University and Yokohama National University. 2003 94

[108] Kalai E. and Lehrer E. Private-Beliefs Equilibrium. Discussion Papers #926. Northwestern University. Center for Mathematical Studies in Economics and Management Science. 1991 [109] Kalai E. and Lehrer E. Rational Learning Leads to Nash Equilibrium. Econometrica 61: 1019-1045. 1993 [110] Kalai E. and Lehrer E. Subjective Equilibrium in Repeated Games. Econometrica 61: 1231-1240. 1993 [111] Kalai E. and Lehrer E. Subjective Games and Equilibria. Games and Economic Behavior 8: 123-163. 1995 [112] Kalai E. and Samet D. Persistent Equilibria in Strategic Games. International Journal of Game Theory 13: 129-144. 1984 [113] Kandori M., Mailath G. J. and Rob R. Learning, Mutation, and Long Run Equilibria in Games. Econometrica 61: 29-56. 1993 [114] Kaniovski Y. M. and Young. H. P. Learning Dynamics in Games with Stochastic Perturbations. Games and Economic Behavior 11: 330-363. 1995 [115] Karandikar R., Mookherjee D., Ray D. and Vega-Redondo F. Evolving Aspirations and Cooperation. Journal of Economic Theory 80: 292-331. 1998 [116] Kohlberg E. and Mertens J.-F. On the Strategic Stability of Equilibria. Econometrica 54: 1003-1038. 1986 [117] Kosfeld M., Droste E. and Voorneveld M. A Myopic Adjustment Process Leading to Best-Reply Matching. Games and Economic Behavior 40: 270298. 2002 [118] Kreps D. M. and Wilson R. Sequential Equilibria. Econometrica 50: 863894. 1982 [119] Laslier J.-F., Topol R. and Walliser B. A Behavioral Learning Process in Games. Games and Economic Behavior 37: 340-366. 2001 [120] Lewis D. Conventions: A Philosophical Study. Harvard University Press, Cambridge 1969 [121] Luce R. D. Individual Choice Behavior: A Theoretical Choice Behavior. Wiley. New York 1959 [122] Macho-Stadler I. and Perez-Castrillo J. D. An Introduction to the Economics of Information 2nd ed. Oxford. New York 2001 [123] Mailath G. J. Do People Play Nash Equilibrium? Lessons from Evolutionary Game Theory. Journal of Economic Literature 36: 1347-1374. 1998 [124] Maskin E. and Riley J. Monopoly with Incomplete Information. Rand Journal of Economics 15: 171-196. 1984 [125] Matsui A. Best Response Dynamics and Socially Stable Strategies. Journal of Economic Theory 57: 343-362. 1992 [126] Maynard Smith J. The Theory of Games and the Evolution of Animal Conflicts. Journal of Theoretical Biology 47: 209-221. 1974 [127] Maynard Smith J. Genes, Memes, and Minds. New York Review of Books 42: 46-48. 1995 [128] Maynard Smith J. Evolution and the Theory of Games. Cambridge University Press. 1982 [129] Maynard Smith J. and Price G. R. The Logic of Animal Conflict. Nature 246: 15-18. 1973 [130] McCarthy J. and Hayes P. J. Some Philosophical Problems from the Standpoint of Artificial Intelligence. Machine Learning 4: 463-502. 1969 95

[131] McFadden D. Conditional Logit Analysis of Qualitative Choice Behavior. In Zarembka P. Frontier in Econometrics. Academic Press. New York 1973 [132] McKelvey R. D. and Palfrey T. R. Quantal Response Equilibria for NormalForm Games. Games and Economic Behavior 10: 6-38. 1995 [133] McKelvey R. D. and Palfrey T. R. Quantal Response Equilibria for Extensive-Form Games. Experimental Economics 1: 9-41. 1998 [134] Mertens J.-F. and Zamir S. Formulation of Bayesian Analysis for Games with Incomplete Information. International Journal of Game Theory 14: 129. 1985 [135] Mirrlees J. A. An Exploration in the Theory of Optimum Income Taxation. Review of Economic Studies 38: 175-208. 1971 [136] Mitropoulos A. Little Information, Efficiency, and Learning - An Experimental Study. Game Theory and Information #0110002. Economics Working Paper Archive at WUSTL. 2001 [137] Moulin H. Axioms of Cooperative Decision Making. Cambridge University Press. Cambridge 1988 [138] Mussa M. and Rosen S. Monopoly and Product Quality. Journal of Economic Theory 18: 301-317. 1978 [139] Myerson R. Refinement of the Nash Equilibrium Concept. International Journal of Game Theory 7: 73-80. 1978 [140] Myerson R. B. Mechanism Design by an Informed Principal. Econometrica 51: 1767-1798. 1983 [141] Myerson R. B. Game Theory. Harvard University Press. Cambridge 1997 [142] Myerson R. B. John Nash’s Contribution to Economics. Games and Economic Behavior 14: 287-295. 1996 [143] Myerson R. B. Nash Equilibrium and the History of Economic Theory. Journal of Economic Literature 37: 1067-1082. 1999 [144] Nachbar J. H. Prediction, Optimization, and Learning in Repeated Games. Econometrica 65: 275-309. 1997 [145] Nash J. F. Equilibrium Points in N-person Games. In Proceedings of the National Academy of Sciences 36: 48-49. 1950 [146] Nash J. Non-Cooperative Games. Dissertation. Princeton University. 1950 [147] Nash J. F. The Bargaining Problem. Econometrica 18: 155-162. 1950 [148] Nash J. F. Noncooperative Games. Annals of Mathematics 54: 289-295. 1951 [149] Nash J. Two-Person Cooperative Games. Econometrica 21: 128-140. 1953 [150] Nehring K. Common Priors under Incomplete Information: A Unification. Economic Theory 18: 535-553. 2001 [151] Nyarko Y. and Schotter A. An Experimental Study of Belief Learning Using Elicited Beliefs. Econometrica 70: 971-1005. 2002 [152] Pareto V. Manual of Political Economy. 1906 [153] Pearce D. Rationalizable Strategic Behavior and the Problem of Perfection. Econometrica 52: 1029-1050. 1984 [154] Raiffa H. Risk, Ambiguity, and the Savage Axioms: Comment. The Quarterly Journal of Economics 75: 690-694. 1961 [155] Rapoport A., Stein W. E., Parco J. E. and Nicholas T. E. Equilibrium Play and Adaptive Learning in a Three-person Centipede Game. Games and Economic Behavior 43: 239-265. 2003 96

[156] Rasmusen E. Games and Information 2nd ed. Blackwell. Malden 1994 [157] Rechenberg I. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog. Stuttgart 1973 [158] Ritzberger K. and Weibull J. W. Evolutionary Selection in Normal-Form Games. Econometrica 63: 1371-1399. 1995 [159] Robinson J. An Iterative Method of Solving a Game. Annals of Mathematics 54: 296-301. 1951 [160] Robson A. J. The Evolution of Strategic Behavior. The Canadian Journal of Economics 28: 17-41. 1995 [161] Ross S. A. The Economic Theory of Agency: The Principal’s Problem. The American Economic Review 63: 134-139. 1973 [162] Roth A. E. and Erev I. Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term. Games and Economic Behavior 8: 164-212. 1995 [163] Rustichini A. Optimal Properties of Stimulus-Response Learning Models. Games and Economic Behavior 29: 244-273. 1999 [164] Salanie B. The Economics and Contracts. MIT Press. Cambridge 2002 [165] Salmon T. C. An Evaluation of Econometric Models of Adaptive Learning. Econometrica 69: 1597-1628. 2001 [166] Sandholm W. H. Evolution in Bayesian Games II: Applications. Working Paper. Boston University and University of Wisconsin. 2003 [167] Sandroni A. Does Rational Learning Lead to Nash Equilibrium in Finitely Repeated Games? Journal of Economic Theory 78: 195-218. 1998 [168] Sarin R. Decision Rules with Bounded Memory. Journal of Economic Theory 90: 151-160. 2000 [169] Sarin R. and Vahid F. Payoff Assessments without Probabilities: A Simple Dynamic Model of Choice. Games and Economic Behavior 28: 294-309. 1999 [170] Sarin R. and Vahid F. Predicting How People Play Games: A Simple Dynamic Model of Choice. Games and Economic Behavior 34: 102-122. 2001 [171] Savage L. J. The Foundations of Statistics. Wiley. New York 1954 [172] Schelling T. The Strategy of Conflict. Harvard University Press. Cambridge 1960 [173] Schlag K. H. Why Imitate, and If So, How? A Boundedly Rational Approach to Multi-armed Bandits. Journal of Economic Theory 78: 130-156. 1998 [174] Selten R. Spieltheoretische Behandlung Eines Oligopolmodels mit Nachfragetragheit. Zeitschrift f¨ ur die Gesamte Staatswissenchaft 121: 301-324, 667-689. 1965 [175] Selten R. Re-examination of the Perfectness Concept for Equilibrium Points in Extensive Games. International Journal of Game Theory 4: 25-55. 1975 [176] Selten R. Evolution, Learning, and Economic Behavior. Games and Economic Behavior 3: 3-24. 1991 [177] Shapley L. Some Topics in Two-Person Games. In Drescher M., Shapley L. and Tucker A. W. Advances in Game Theory 52: 1-28. Princeton University Press. Princeton 1964 [178] Simon H. A. A Behavioral Model of Rational Choice. Quarterly Journal of Economics 69: 99-118. 1955 [179] Skinner B. F. The Behavior of Organisms: An Experimental Analysis. 97

Appleton-Century. New York 1938 [180] Skinner B. F. Science and Human Behavior. Macmillan. New York 1953 [181] Sobel J. Economists’ Models of Learning. Journal of Economic Theory 94: 241-261. 2000 [182] Spence M. Nonlinear Prices and Economic Welfare. Journal of Public Economics 8: 1-18. 1977 [183] Spence M. A. Multi-Product Quantity-Dependent Prices and Profitability Constraints. The Review of Economic Studies 47: 821-841. 1980 [184] Stahl D. O. Evidence Based Rules and Learning in Symmetric Normal-Form Games. International Journal of Game Theory 28: 111-130. 1999 [185] Stahl D. O. Rule Learning in Symmetric Normal-Form Games: Theory and Evidence. Games and Economic Behavior 32: 105-138. 2000 [186] Stahl D. O. Action Reinforcement Learning versus Rule Learning. Greek Economic Review. Forthcoming [187] Swinkels J. M. Adjustment Dynamics and Rational Play in Games. Games and Economic Behavior 5: 455-484. 1993 [188] Taylor P. and Jonker L. Evolutionary Stable Strategies and Game Dynamics. Mathematical Biosciences 40: 145-156. 1978 [189] Thorndike E. L. Animal Intelligence: An Experimental Study of the Associative Processes in Animals. Psychological Review. Monograph Supplements 8. Macmillan. New York 1898 [190] Thorndike E.L. Animal Intelligence. Macmillan. New York 1911 [191] Thurstone L. Psychophysical analysis. American Journal of Psychology 28: 368-389. 1927 [192] Thurstone L. L. The Learning Function. Journal of General Psychology 3: 469-493. 1930 [193] Turgot A.-R.-J. R´ eflexions sur la formation et la distribution des richesses. 1766 [194] Tversky A. and Kahneman D. Judgment under Uncertainty: Heuristics and Biases. Science 185: 1124-1131. 1974 [195] Varian H. R. Microeconomic Analysis 3rd ed. W.W. Norton. New York 1992 [196] Van Huyck J. B., Battalio R. C. and Rankin F. W. Selection Dynamics and Adaptive Behavior Without Much Information. Unpublished Manuscript. Texas A&M University. 2001 [197] Von Hayek F. A. Economics and Knowledge. Economica 4: 33-54. 1937 [198] Von Neumann J. Zur Theories der Gesellschaftsspiele. Mathematische Annalen 100: 295-320. 1928 [199] Von Neumann J. and Morgenstern O. Theory of Games and Economic Behavior. Princeton University Press. 1944 [200] Von Stackelberg H. Marktform und Gleichgewicht. Springer. Vienna 1934 [201] Vostroknutov A. Nonprobabilistic Decision Making: A Model of Choice with Finite Memory. Dissertation. University College London. 2002 ´ L. El´ ´ ements d’´ [202] Walras M.-E. economie politique pure, ou th´ eorie de la richesse sociale. 1874 [203] Washburn A. A New Kind of Fictitious Play. Naval Research Logistics 48: 270-280. 2001 [204] Watson J. B. Behaviorism. University of Chicago Press. Chicago 1930 [205] Weibull J. W. Evolutionary Game Theory. MIT Press. 1995 98

[206] Weibull J. W. What have we learned from Evolutionary Game Theory so far? Working Paper #487. IUI Working Paper Series. The Research Institute of Industrial Economics. 1997 [207] Weibull J. W. Evolution, Rationality, and Equilibrium in Games. European Economic Review 42: 641-649. 1998 [208] Young H. P. The Evolution of Conventions. Econometrica 61: 57-84. 1993 [209] Young H. P. Individual Learning and Social Rationality. European Economic Review 42: 651-663. 1998 [210] Young H. P. Individual Strategy and Social Structure: An Evolutionary Theory of Institutions. Princeton University Press. Princeton 2001 ¨ [211] Zermelo E. Uber eine Anwendung der Mengenlehre auf die Theorie des Schachspiels. In Hobson E. W. and Love A. E. H. Proceedings of the Fifth International Congress of Mathematicians 2: 501-504. Cambridge University Press. 1912

99