Introduction to Game Theory

Introduction to Game Theory 7. Repeated Games Dana Nau University of Maryland Nau: Game Theory 1 Repeated Games   Used by game theorists, economis...

Author: Mavis O’Neal’

8 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Introduction to Game Theory

Introduction to Game Theory: Finite Dynamic Games

Introduction to Game Theory: Cooperative Games (2)

Introduction to Game Theory: Infinite Dynamic Games

Introduction to Game Theory: Static Games

Introduction to Game Theory: Cooperative Games

CMSC 474, Introduction to Game Theory 22. Introduction to Auctions

Introduction to Game Theory Matrix Games and Lagrangian Duality

Introduction to Game Theory: Games with Continuous Strategy Sets

GAMES AND INFORMATION, FOURTH EDITION An Introduction to Game Theory

Introduction to game design

Introduction to Game Development

CMSC 474, Introduction to Game Theory Game-tree Search and Pruning Algorithms

Introduction to Queueing Theory

Introduction to Superstring Theory

Introduction to Coding Theory

Introduction to Group Theory

Introduction to Number Theory

INTRODUCTION TO INFORMATION THEORY

Introduction to Automata Theory,

FROM CLASSICAL TO EPISTEMIC GAME THEORY

HOW TO MAKE SENSE OF GAME THEORY

Schelling's Game Theory: How to Make Decisions

Introduction to Game Theory 7. Repeated Games Dana Nau University of Maryland

Nau: Game Theory 1

Repeated Games   Used by game theorists, economists, social and behavioral scientists

as highly simplified models of various real-world situations

Iterated Prisoner’s Dilemma

Iterated Chicken Game

Roshambo

Iterated Battle of the Sexes Repeated Ultimatum Game

Repeated Stag Hunt

Repeated Matching Nau:Pennies Game Theory 2

Finitely Repeated Games Prisoner’s Dilemma:

  In repeated games, some game G is played

multiple times by the same set of agents

2

C

D

C

3, 3

0, 5

D

5, 0

1, 1

1

  G is called the stage game

•  Usually (but not always), G is a normal-form game   Each occurrence of G is called

an iteration or a round   Usually each agent knows what all

the agents did in the previous iterations, but not what they’re doing in the current iteration   Thus, an imperfect-information

game with perfect recall   Usually each agent’s

payoff function is additive

Iterated Prisoner’s Dilemma, with 2 iterations: Agent 1:

Agent 2:

Round 1:

C

C

Round 2:

D

C

3+5 = 5

3+0 = 3

Total payoff:

Nau: Game Theory 3

Strategies   The repeated game has a much bigger strategy space than the stage game   One kind of strategy is a stationary strategy:   Use the same strategy

at every iteration   More generally, an

Iterated Prisoner’s Dilemma with 2 iterations:

agent’s play at each stage may depend on the history   What happened in

previous iterations

Nau: Game Theory 4

Backward Induction   If the number of iterations is finite and known, we can use backward

induction to get a subgame-perfect equilibrium   Example: finitely many repetitions of the Prisoner’s Dilemma   In the last round,

the dominant strategy is D   That’s common knowledge   So in the 2nd-to-last round,

D also is the dominant strategy

Agent 1:

Agent 2:

Round 1:

D

D

Round 2:

D

D

Round 3:

D

D

Round 4: this argument is vulnerable to both empirical and theoretical criticisms

D

D

  …   The SPE is (D,D) on every round

  As with the Centipede game,

Nau: Game Theory 5

Backward Induction when G is 0-sum   As before, backward induction works much better in zero-sum games   In the last round, equilibrium is the minimax profile

•  Each agent uses his/her minimax strategy   That’s common knowledge   So in the 2nd-to-last round, it

again is the minimax strategies …   The SPE is (D,D) on every round

Nau: Game Theory 6

Infinitely Repeated Games   An infinitely repeated game in extensive form would be an infinite tree   Payoffs can’t be attached to any terminal nodes   Payoffs can’t be the sums of the payoffs in the stage games (generally infinite)

  Two common ways around this problem   Let r (1)i , r (2)i , … be an infinite sequence of payoffs for agent i

  Agent i’s average reward is

k

lim ∑ j=1 ri( j ) / k k→∞

  Agent i’s future discounted reward is the discounted sum of the payoffs, i.e., ∞

∑

j ( j) β ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor j=1

  Two ways to interpret the discount factor:

€

1.  The agent cares more about the preset than the future 2.  The agent cares about the future, but the game ends at any round with probability 1 − β Nau: Game Theory 7

Example   Some well-known strategies for the Iterated Prisoner’s Dilemma:

»  AllC: always cooperate »  AllD (the Hawk strategy):

TFT or Grim AllD

TFT Tester

C

C

C

D

C

D

C

C

D

D

D

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

...

...

…

…

…

…

always defect »  Grim: cooperate until the other agent defects, then defect forever »  Tit-for-Tat (TFT): cooperate on the first move. On the nth move, repeat the other agent (n–1)th move »  Tester: defect on move 1. If the other agent retaliates, play TFT. Otherwise, randomly intersperse cooperation and defection

AllC, AllC, Grim, Grim, or TFT or TFT

  If the discount factor is large enough, each of the following is a Nash equilibrium   (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM) Nau: Game Theory 8

Equilibrium Payoffs for Repeated Games   There’s a “folk theorem” that tells what the possible equilibrium payoffs

are in repeated games   It says roughly the following:   In an infinitely repeated game whose stage game is G, there is a

Nash equilibrium whose average payoffs are (p1, p2, …, pn) if and only if   G has a mixed-strategy profile (s1, s2, …, sn) with the following

property:

•  For each i, si’s payoff would be ≥ pi if the other agents used minimax strategies against i

Nau: Game Theory 9

Proof and Examples Example 2: IPD with (p1, p2) = (2.5,2.5)

Grim

Other agent

Agent 1 Agent 2

gives each agent i the average payoff pi, given certain constraints on (p1, p2, …, pn)

C

C

D

C

C

C

C

D

•  In this equilibrium, the agents cycle in lock-step through a sequence of game outcomes that achieve (p1, p2, …, pn)

C

C

D

C

C

C

C

D

C

D

D

D

D

C

D

D

D

C

D

C

D

C

D

D

…

…

  Use the definitions of minimax and best-

response to show that in every equilibrium, an agent’s average payoff ≥ the agent’s minimax value   Show how to construct an equilibrium that

•  If any agent i deviates, then the others punish i forever, by playing their minimax strategies against i   There’s a large family of such theorems,

for various conditions on the game

…

Example 1: IPD with (p1, p2) = (3,3)

…

  The proof proceeds in 2 parts:

Nau: Game Theory 10

Zero-Sum Repeated Games   For two-player zero-sum repeated games, the folk theorem is still true, but it

becomes vacuous   Suppose we iterate a two-player zero-sum game G   Let V be the value of G (from the Minimax Theorem)   If agent 2 uses a minimax strategy against 1, then 1’s maximum payoff is V

•  Thus max value for p1 is V, so min value for p2 is –V   If agent 1 uses a minimax strategy against 2, then 2’s maximum payoff is –V

•  Thus max value for p2 is –V, so min value for p1 is V   Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)   The only way to get this is if each agent always plays his/her minimax strategy

•  If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best response, 2’s expected payoff will be higher than –V

Nau: Game Theory 11

Roshambo (Rock, Paper, Scissors) Rock

Paper

Scissors

Rock

0, 0

–1, 1

1, –1

Paper

1, –1

0, 0

–1, 1

Scissors

–1, 1

1, –1

0, 0

A2

A1

  Nash equilibrium for the stage game:   choose randomly, P=1/3 for each move

  Nash equilibrium for the repeated game:   always choose randomly, P=1/3 for each move

  Expected payoff = 0   Let’s see how that works out in practice … Nau: Game Theory 12

Roshambo (Rock, Paper, Scissors) Rock

Paper

Scissors

Rock

0, 0

–1, 1

1, –1

Paper

1, –1

0, 0

–1, 1

Scissors

–1, 1

1, –1

0, 0

A2

A1

  1999 international roshambo programming competition

www.cs.ualberta.ca/~darse/rsbpc1.html   Round-robin tournament:

•  55 programs, 1000 iterations for each pair of programs •  Lowest possible score = –55000, highest possible score = 55000   Average over 25 tournaments:

•  Highest score (Iocaine Powder): 13038 •  Lowest score (Cheesebot): –36006   Very different from the game-theoretic prediction

Nau: Game Theory 13

  A Nash equilibrium strategy is best for you

if the other agents also use their Nash equilibrium strategies   In many cases, the other agents won’t use Nash equilibrium strategies   If you can forecast their actions accurately, you may be able to do

much better than the Nash equilibrium strategy   Why won’t the other agents use their Nash equilibrium strategies?   Because they may be trying to forecast your actions too

  Something analogous can happen in non-zero-sum games

Nau: Game Theory 14

Iterated Prisoner’s Dilemma Prisoner’s Dilemma   Multiple iterations of the Prisoner’s Dilemma

Cooperate

Defect

Cooperate

3, 3

0, 5

Defect

5, 0

1, 1

P1   Widely used to study the emergence of

cooperative behavior among agents   e.g., Axelrod (1984), The Evolution of Cooperation

P2

Nash equilibrium

  Axelrod ran a famous set of tournaments   People contributed strategies

encoded as computer programs   Axelrod played them against each other

If I defect now, he might punish me by defecting next time

Nau: Game Theory 15

TFT with Other Agents   In Axelrod’s tournaments, TFT usually did best

»  It could establish and maintain cooperations with many other agents »  It could prevent malicious agents from taking advantage of it

TFT AllC

TFT AllD

TFT Grim

TFT TFT

TFT Tester

C

C

D

C

C

C

C

C

D

C

C

D

D

C

C

C

C

D

C

C

C

D

D

C

C

C

C

C

C

C

C

D

D

C

C

C

C

C

C

C

D

C

C C

C

C

C C

C

D

D D

C

C

C C

C

C C

C

C

D

D

C

C

C

C

C

C

...

...

…

…

…

…

…

…

…

…

C

Nau: Game Theory 16

Example:   A real-world example of the IPD, described in Axelrod’s book:   World War I trench warfare

  Incentive to cooperate:   If I attack the other side, then they’ll retaliate and I’ll get hurt   If I don’t attack, maybe they won’t either

  Result: evolution of cooperation   Although the two infantries were supposed to be enemies, they

avoided attacking each other

Nau: Game Theory 17

IPD with Noise   In noisy environments,   There’s a nonzero probability (e.g., 10%)

C C C

C C D C

Noise

…

…

that a “noise gremlin” will change some of the actions •  Cooperate (C) becomes Defect (D), and vice versa   Can use this to model accidents   Compute the score using the changed action   Can also model misinterpretations   Compute the score using the original action

Did he really intend to do that?

Nau: Game Theory 18

Example of Noise

  Story from a British army officer in World War I:   I was having tea with A Company when we heard a lot of shouting and went

out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: “We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.”   The salvo wasn’t the German infantry’s intention   They didn’t expect it nor desire it Nau: Game Theory 19

Noise Makes it Difficult to Maintain Cooperation

  Consider two agents

who both use TFT   One accident or misinterpretation can cause a long string of retaliations

Retaliation

Retaliation

C C C C D C D C

C C C D C C D C D

Noise"

Retaliation

Retaliation

...

...

Nau: Game Theory 20

Some Strategies for the Noisy IPD   Principle: be more forgiving in the face of defections   Tit-For-Two-Tats (TFTT)

»  Retaliate only if the other agent defects twice in a row •  Can tolerate isolated instances of defections, but susceptible to exploitation of its generosity •  Beaten by the TESTER strategy I described earlier   Generous Tit-For-Tat (GTFT) »  Forgive randomly: small probability of cooperation if the other agent defects »  Better than TFTT at avoiding exploitation, but worse at maintaining cooperation   Pavlov »  Win-Stay, Lose-Shift •  Repeat previous move if I earn 3 or 5 points in the previous iteration •  Reverse previous move if I earn 0 or 1 points in the previous iteration »  Thus if the other agent defects continuously, Pavlov will alternatively cooperate and defect

Nau: Game Theory 21

Discussion   The British army officer’s story:   a German shouted, ``We are very sorry about that; we hope no one was

hurt. It is not our fault. It is that damned Prussian artillery.”   The apology avoided a conflict   It was convincing because it was consistent with the German infantry’s

past behavior   The British had ample evidence that the German infantry wanted to

keep the peace   If you can tell which actions are affected by noise, you can avoid reacting

to the noise   IPD agents often behave deterministically   For others to cooperate with you it helps if you’re predictable

  This makes it feasible to build a model from observed behavior Nau: Game Theory 22

The DBS Agent   Work by my recent PhD graduate, Tsz-Chiu Au   Now a postdoc at University of Texas

  From the other agent’s recent behavior, build a model π of the other

agent’s strategy   Use the model to filter noise   Use the model to help plan our next move

Au & Nau. Accident or intention: That is the question (in the iterated prisoner’s dilemma). AAMAS, 2006. Au & Nau. Is it accidental or intentional? A symbolic approach to the noisy iterated prisoner’s dilemma. In G. Kendall (ed.), The Iterated Prisoners Dilemma: 20 Years On. World Scientific, 2007.

Nau: Game Theory 23

Modeling the other agent   A set of rules of the following form

if our last move was m and their last move was m' then P[their next move will be C]   Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)

  For example, TFT can be described as   (C,C)

1, (C, D)

1, (D, C )

0, (D, D)

0

  How to get the probabilities?   One way: look at the agent’s behavior in the recent past

  During the last k iterations,   What fraction of the time did the other agent cooperate at iteration j

when the agents’ moves were (x,y) at iteration j–1?

Nau: Game Theory 24

Modeling the other agent   π can only model a very small set of strategies   It doesn’t even model the Grim strategy correctly:   If Grim defects, it may be defecting because of something that

happened many moves ago   But we’re not trying to model an agent’s entire strategy, just its recent

behavior   If an agent’s behavior changes, then the probabilities in π will change   e.g., after Grim defects a few times, the rules will give a very low

probability of it cooperating again

Nau: Game Theory 25

Noise Filtering   Suppose the applicable rule is

deterministic   P[their next move will be C] = 0 or 1

  If the other agent’s next move

isn’t what the rule predicts, then   Assume the

observed action is noise   Behave as if the

action were what the rule predicted

The other agent cooperates when I do So I won’t retaliate here. I think these defections are actually noise

C C C C C C C D C C C C D C C C C C : : Nau: Game Theory 26

I am Grim. If you ever betray me, I will never forgive you.

Change of Behavior   Anomalies in observed behavior can be due

either to noise or to a genuine change of behavior   Changes of behavior occur because   The other agent can change its strategy

anytime   E.g., if noise affects one of Agent 1’s

actions, this may trigger a change in Agent 2’s behavior •  Agent 1 does not know this   How to distinguish noise from a real change

of behavior?

C

C

C

C

CD C

C D

C

D

C

D

:

D

:

D

:

:

These moves are not noise

Nau: Game Theory 27

Detection of a Change of Behavior Temporary tolerance:   When we observe unexpected

behavior from the other agent   Don’t immediately decide

whether it’s noise or a real change of behavior   Instead, defer judgment

for a few iterations   If the anomaly persists, then

recompute π based on the other agent’s recent behavior

The other agent cooperates when I do The defections might be accidents, so I shouldn’t lose my temper too soon I think the other agent’s has really changed, so I’ll change mine too

C C C C C C D D :

C C C D D D D D : Nau: Game Theory 28

Move generation   Modified version of game-tree search   Use the policy π to predict probabilities of the other agent’s moves   Compute expected utility) for move x as

u1(x) = ∑ y∈{C,D} u1(x,y) × P(y | π, previous moves) where x = my move, y = other agent’s move   Choose the move with the highest expected utility

(C,C)

:

:

:

:

:

(C,D) (D,C)

:

:

:

:

Current Iteration

(D,D)

:

:

:

Next Iteration

:

:

:

:

Iteration after next Nau: Game Theory 29

Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1

Example

(C,C)

C C

C D

D C

C C ??

(C,D) (D,C)

(D,D)

  Suppose we search to depth 1

u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1 u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8

»  So D looks better   Is D really what we should choose?

Rule 1 predicts P(C) = 0.7, P(D) = 0.3 Nau: Game Theory 30

Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1

Example

(C,C)

C C

C D

D C

C C ??

Rule 1 predicts P(C) = 0.7, P(D) = 0.3

(C,D) (D,C)

(D,D)

  It’s not wise to choose D

»  On the move after that, the opponent will

retaliate with P=0.9 »  The depth-1 search didn’t see this   But if we search to depth d>1, we’ll see it   C will look better and we’ll choose it instead   In general, it’s best look far ahead

»  e.g., 60 moves

Nau: Game Theory 31

How to Search Deeper   Game trees grow exponentially with search depth

»  How to search to the tree deeply?   Key assumption: π accurately models the other agent’s future behavior   Then we can use dynamic programming

»  Makes the search polynomial in the search depth »  Can easily search to depth 60 »  Equivalent to solving an acyclic MDP of depth 60   This generates fairly good moves

(C,C)

Current iteration

(C,D) (D,C)

(D,D)

Next iteration

iteration after next :

:

:

: Nau: Game Theory 32

20th Anniversary IPD Competition http://www.prisoners-dilemma.com

  Category 2: IPD with noise   165 programs participated

  DBS dominated the

top 10 places   Two agents scored

higher than DBS   They both used

master-and-slaves strategies

Nau: Game Theory 33

Master & Slaves Strategy   Each participant could submit up to 20 programs   Some submitted programs that could recognize each other   (by communicating pre-arranged sequences of Cs and Ds)

  The 20 programs worked as a team

•  1 master, 19 slaves   When a slave plays with its master

•  Slave cooperates, master defects

My goons give me all their money …

=> maximizes the master’s payoff   When a slave plays with

an agent not in its team •  It defects

… and they beat up everyone else

=> minimizes the other agent’s payoff

Nau: Game Theory 34

Comparison   Analysis   Each master-slaves team’s average score was much lower than DBS’s   If BWIN and IMM01 had each been restricted to ≤ 10 slaves,

DBS would have placed 1st   Without any slaves, BWIN and IMM01 would have done badly

  In contrast, DBS had no slaves   DBS established cooperation

with many other agents   DBS did this despite the noise,

because it filtered out the noise

Nau: Game Theory 35

Summary   Finitely repeated games – backward induction   Infinitely repeated games   average reward, future discounted reward   equilibrium payoffs

  Non-equilibrium strategies   opponent modeling in roshambo   iterated prisoner’s dilemma with noise

•  opponent models based on observed behavior •  detection and removal of noise •  game-tree search against the opponent model   20th anniversary IPD competition

Nau: Game Theory 36