Introduction to Game Theory

Introduction to Game Theory 7. Repeated Games Dana Nau University of Maryland Nau: Game Theory 1 Repeated Games   Used by game theorists, economis...
8 downloads 0 Views 2MB Size
Introduction to Game Theory 7. Repeated Games Dana Nau University of Maryland

Nau: Game Theory 1

Repeated Games   Used by game theorists, economists, social and behavioral scientists

as highly simplified models of various real-world situations

Iterated Prisoner’s Dilemma

Iterated Chicken Game

Roshambo

Iterated Battle of the Sexes Repeated Ultimatum Game

Repeated Stag Hunt

Repeated Matching Nau:Pennies Game Theory 2

Finitely Repeated Games Prisoner’s Dilemma:

  In repeated games, some game G is played

multiple times by the same set of agents

2

C

D

C

3, 3

0, 5

D

5, 0

1, 1

1

  G is called the stage game

•  Usually (but not always), G is a normal-form game   Each occurrence of G is called

an iteration or a round   Usually each agent knows what all

the agents did in the previous iterations, but not what they’re doing in the current iteration   Thus, an imperfect-information

game with perfect recall   Usually each agent’s

payoff function is additive

Iterated Prisoner’s Dilemma, with 2 iterations: Agent 1:

Agent 2:

Round 1:

C

C

Round 2:

D

C

3+5 = 5

3+0 = 3

Total payoff:

Nau: Game Theory 3

Strategies   The repeated game has a much bigger strategy space than the stage game   One kind of strategy is a stationary strategy:   Use the same strategy

at every iteration   More generally, an

Iterated Prisoner’s Dilemma with 2 iterations:

agent’s play at each stage may depend on the history   What happened in

previous iterations

Nau: Game Theory 4

Backward Induction   If the number of iterations is finite and known, we can use backward

induction to get a subgame-perfect equilibrium   Example: finitely many repetitions of the Prisoner’s Dilemma   In the last round,

the dominant strategy is D   That’s common knowledge   So in the 2nd-to-last round,

D also is the dominant strategy

Agent 1:

Agent 2:

Round 1:

D

D

Round 2:

D

D

Round 3:

D

D

Round 4: this argument is vulnerable to both empirical and theoretical criticisms

D

D

  …   The SPE is (D,D) on every round

  As with the Centipede game,

Nau: Game Theory 5

Backward Induction when G is 0-sum   As before, backward induction works much better in zero-sum games   In the last round, equilibrium is the minimax profile

•  Each agent uses his/her minimax strategy   That’s common knowledge   So in the 2nd-to-last round, it

again is the minimax strategies …   The SPE is (D,D) on every round

Nau: Game Theory 6

Infinitely Repeated Games   An infinitely repeated game in extensive form would be an infinite tree   Payoffs can’t be attached to any terminal nodes   Payoffs can’t be the sums of the payoffs in the stage games (generally infinite)

  Two common ways around this problem   Let r (1)i , r (2)i , … be an infinite sequence of payoffs for agent i

  Agent i’s average reward is

k

lim ∑ j=1 ri( j ) / k k→∞

  Agent i’s future discounted reward is the discounted sum of the payoffs, i.e., ∞



j ( j) β ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor j=1

  Two ways to interpret the discount factor:



1.  The agent cares more about the preset than the future 2.  The agent cares about the future, but the game ends at any round with probability 1 − β Nau: Game Theory 7

Example   Some well-known strategies for the Iterated Prisoner’s Dilemma:

»  AllC: always cooperate »  AllD (the Hawk strategy):

TFT or Grim AllD

TFT Tester

C

C

C

D

C

D

C

C

D

D

D

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

C

C

D

D

C

C

...

...









always defect »  Grim: cooperate until the other agent defects, then defect forever »  Tit-for-Tat (TFT): cooperate on the first move. On the nth move, repeat the other agent (n–1)th move »  Tester: defect on move 1. If the other agent retaliates, play TFT. Otherwise, randomly intersperse cooperation and defection

AllC, AllC, Grim, Grim, or TFT or TFT

  If the discount factor is large enough, each of the following is a Nash equilibrium   (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM) Nau: Game Theory 8

Equilibrium Payoffs for Repeated Games   There’s a “folk theorem” that tells what the possible equilibrium payoffs

are in repeated games   It says roughly the following:   In an infinitely repeated game whose stage game is G, there is a

Nash equilibrium whose average payoffs are (p1, p2, …, pn) if and only if   G has a mixed-strategy profile (s1, s2, …, sn) with the following

property:

•  For each i, si’s payoff would be ≥ pi if the other agents used minimax strategies against i

Nau: Game Theory 9

Proof and Examples Example 2: IPD with (p1, p2) = (2.5,2.5)

Grim

Other agent

Agent 1 Agent 2

gives each agent i the average payoff pi, given certain constraints on (p1, p2, …, pn)

C

C

D

C

C

C

C

D

•  In this equilibrium, the agents cycle in lock-step through a sequence of game outcomes that achieve (p1, p2, …, pn)

C

C

D

C

C

C

C

D

C

D

D

D

D

C

D

D

D

C

D

C

D

C

D

D





  Use the definitions of minimax and best-

response to show that in every equilibrium, an agent’s average payoff ≥ the agent’s minimax value   Show how to construct an equilibrium that

•  If any agent i deviates, then the others punish i forever, by playing their minimax strategies against i   There’s a large family of such theorems,

for various conditions on the game



Example 1: IPD with (p1, p2) = (3,3)



  The proof proceeds in 2 parts:

Nau: Game Theory 10

Zero-Sum Repeated Games   For two-player zero-sum repeated games, the folk theorem is still true, but it

becomes vacuous   Suppose we iterate a two-player zero-sum game G   Let V be the value of G (from the Minimax Theorem)   If agent 2 uses a minimax strategy against 1, then 1’s maximum payoff is V

•  Thus max value for p1 is V, so min value for p2 is –V   If agent 1 uses a minimax strategy against 2, then 2’s maximum payoff is –V

•  Thus max value for p2 is –V, so min value for p1 is V   Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)   The only way to get this is if each agent always plays his/her minimax strategy

•  If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best response, 2’s expected payoff will be higher than –V

Nau: Game Theory 11

Roshambo (Rock, Paper, Scissors) Rock

Paper

Scissors

Rock

0, 0

–1, 1

1, –1

Paper

1, –1

0, 0

–1, 1

Scissors

–1, 1

1, –1

0, 0

A2

A1

  Nash equilibrium for the stage game:   choose randomly, P=1/3 for each move

  Nash equilibrium for the repeated game:   always choose randomly, P=1/3 for each move

  Expected payoff = 0   Let’s see how that works out in practice … Nau: Game Theory 12

Roshambo (Rock, Paper, Scissors) Rock

Paper

Scissors

Rock

0, 0

–1, 1

1, –1

Paper

1, –1

0, 0

–1, 1

Scissors

–1, 1

1, –1

0, 0

A2

A1

  1999 international roshambo programming competition

www.cs.ualberta.ca/~darse/rsbpc1.html   Round-robin tournament:

•  55 programs, 1000 iterations for each pair of programs •  Lowest possible score = –55000, highest possible score = 55000   Average over 25 tournaments:

•  Highest score (Iocaine Powder): 13038 •  Lowest score (Cheesebot): –36006   Very different from the game-theoretic prediction

Nau: Game Theory 13

  A Nash equilibrium strategy is best for you

if the other agents also use their Nash equilibrium strategies   In many cases, the other agents won’t use Nash equilibrium strategies   If you can forecast their actions accurately, you may be able to do

much better than the Nash equilibrium strategy   Why won’t the other agents use their Nash equilibrium strategies?   Because they may be trying to forecast your actions too

  Something analogous can happen in non-zero-sum games

Nau: Game Theory 14

Iterated Prisoner’s Dilemma Prisoner’s Dilemma   Multiple iterations of the Prisoner’s Dilemma

Cooperate

Defect

Cooperate

3, 3

0, 5

Defect

5, 0

1, 1

P1   Widely used to study the emergence of

cooperative behavior among agents   e.g., Axelrod (1984), The Evolution of Cooperation

P2

Nash equilibrium

  Axelrod ran a famous set of tournaments   People contributed strategies

encoded as computer programs   Axelrod played them against each other

If I defect now, he might punish me by defecting next time

Nau: Game Theory 15

TFT with Other Agents   In Axelrod’s tournaments, TFT usually did best

»  It could establish and maintain cooperations with many other agents »  It could prevent malicious agents from taking advantage of it

TFT AllC

TFT AllD

TFT Grim

TFT TFT

TFT Tester

C

C

D

C

C

C

C

C

D

C

C

D

D

C

C

C

C

D

C

C

C

D

D

C

C

C

C

C

C

C

C

D

D

C

C

C

C

C

C

C

D

C

C C

C

C

C C

C

D

D D

C

C

C C

C

C C

C

C

D

D

C

C

C

C

C

C

...

...

















C

Nau: Game Theory 16

Example:   A real-world example of the IPD, described in Axelrod’s book:   World War I trench warfare

  Incentive to cooperate:   If I attack the other side, then they’ll retaliate and I’ll get hurt   If I don’t attack, maybe they won’t either

  Result: evolution of cooperation   Although the two infantries were supposed to be enemies, they

avoided attacking each other

Nau: Game Theory 17

IPD with Noise   In noisy environments,   There’s a nonzero probability (e.g., 10%)

C C C

C C D C

Noise





that a “noise gremlin” will change some of the actions •  Cooperate (C) becomes Defect (D), and vice versa   Can use this to model accidents   Compute the score using the changed action   Can also model misinterpretations   Compute the score using the original action

Did he really intend to do that?

Nau: Game Theory 18

Example of Noise

  Story from a British army officer in World War I:   I was having tea with A Company when we heard a lot of shouting and went

out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: “We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.”   The salvo wasn’t the German infantry’s intention   They didn’t expect it nor desire it Nau: Game Theory 19

Noise Makes it Difficult to Maintain Cooperation

  Consider two agents

who both use TFT   One accident or misinterpretation can cause a long string of retaliations

Retaliation

Retaliation

C C C C D C D C

C C C D C C D C D

Noise"

Retaliation

Retaliation

...

...

Nau: Game Theory 20

Some Strategies for the Noisy IPD   Principle: be more forgiving in the face of defections   Tit-For-Two-Tats (TFTT)

»  Retaliate only if the other agent defects twice in a row •  Can tolerate isolated instances of defections, but susceptible to exploitation of its generosity •  Beaten by the TESTER strategy I described earlier   Generous Tit-For-Tat (GTFT) »  Forgive randomly: small probability of cooperation if the other agent defects »  Better than TFTT at avoiding exploitation, but worse at maintaining cooperation   Pavlov »  Win-Stay, Lose-Shift •  Repeat previous move if I earn 3 or 5 points in the previous iteration •  Reverse previous move if I earn 0 or 1 points in the previous iteration »  Thus if the other agent defects continuously, Pavlov will alternatively cooperate and defect

Nau: Game Theory 21

Discussion   The British army officer’s story:   a German shouted, ``We are very sorry about that; we hope no one was

hurt. It is not our fault. It is that damned Prussian artillery.”   The apology avoided a conflict   It was convincing because it was consistent with the German infantry’s

past behavior   The British had ample evidence that the German infantry wanted to

keep the peace   If you can tell which actions are affected by noise, you can avoid reacting

to the noise   IPD agents often behave deterministically   For others to cooperate with you it helps if you’re predictable

  This makes it feasible to build a model from observed behavior Nau: Game Theory 22

The DBS Agent   Work by my recent PhD graduate, Tsz-Chiu Au   Now a postdoc at University of Texas

  From the other agent’s recent behavior, build a model π of the other

agent’s strategy   Use the model to filter noise   Use the model to help plan our next move

Au & Nau. Accident or intention: That is the question (in the iterated prisoner’s dilemma). AAMAS, 2006. Au & Nau. Is it accidental or intentional? A symbolic approach to the noisy iterated prisoner’s dilemma. In G. Kendall (ed.), The Iterated Prisoners Dilemma: 20 Years On. World Scientific, 2007.

Nau: Game Theory 23

Modeling the other agent   A set of rules of the following form

if our last move was m and their last move was m' then P[their next move will be C]   Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)

  For example, TFT can be described as   (C,C)

1, (C, D)

1, (D, C )

0, (D, D)

0

  How to get the probabilities?   One way: look at the agent’s behavior in the recent past

  During the last k iterations,   What fraction of the time did the other agent cooperate at iteration j

when the agents’ moves were (x,y) at iteration j–1?

Nau: Game Theory 24

Modeling the other agent   π can only model a very small set of strategies   It doesn’t even model the Grim strategy correctly:   If Grim defects, it may be defecting because of something that

happened many moves ago   But we’re not trying to model an agent’s entire strategy, just its recent

behavior   If an agent’s behavior changes, then the probabilities in π will change   e.g., after Grim defects a few times, the rules will give a very low

probability of it cooperating again

Nau: Game Theory 25

Noise Filtering   Suppose the applicable rule is

deterministic   P[their next move will be C] = 0 or 1

  If the other agent’s next move

isn’t what the rule predicts, then   Assume the

observed action is noise   Behave as if the

action were what the rule predicted

The other agent cooperates when I do So I won’t retaliate here. I think these defections are actually noise

C C C C C C C D C C C C D C C C C C : : Nau: Game Theory 26

I am Grim. If you ever betray me, I will never forgive you.

Change of Behavior   Anomalies in observed behavior can be due

either to noise or to a genuine change of behavior   Changes of behavior occur because   The other agent can change its strategy

anytime   E.g., if noise affects one of Agent 1’s

actions, this may trigger a change in Agent 2’s behavior •  Agent 1 does not know this   How to distinguish noise from a real change

of behavior?

C

C

C

C

CD C

C D

C

D

C

D

:

D

:

D

:

:

These moves are not noise

Nau: Game Theory 27

Detection of a Change of Behavior Temporary tolerance:   When we observe unexpected

behavior from the other agent   Don’t immediately decide

whether it’s noise or a real change of behavior   Instead, defer judgment

for a few iterations   If the anomaly persists, then

recompute π based on the other agent’s recent behavior

The other agent cooperates when I do The defections might be accidents, so I shouldn’t lose my temper too soon I think the other agent’s has really changed, so I’ll change mine too

C C C C C C D D :

C C C D D D D D : Nau: Game Theory 28

Move generation   Modified version of game-tree search   Use the policy π to predict probabilities of the other agent’s moves   Compute expected utility) for move x as

u1(x) = ∑ y∈{C,D} u1(x,y) × P(y | π, previous moves) where x = my move, y = other agent’s move   Choose the move with the highest expected utility

(C,C)

:

:

:

:

:

(C,D) (D,C)

:

:

:

:

Current Iteration

(D,D)

:

:

:

Next Iteration

:

:

:

:

Iteration after next Nau: Game Theory 29

Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1

Example

(C,C)

C C

C D

D C

C C ??

(C,D) (D,C)

(D,D)

  Suppose we search to depth 1

u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1 u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8

»  So D looks better   Is D really what we should choose?

Rule 1 predicts P(C) = 0.7, P(D) = 0.3 Nau: Game Theory 30

Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1

Example

(C,C)

C C

C D

D C

C C ??

Rule 1 predicts P(C) = 0.7, P(D) = 0.3

(C,D) (D,C)

(D,D)

  It’s not wise to choose D

»  On the move after that, the opponent will

retaliate with P=0.9 »  The depth-1 search didn’t see this   But if we search to depth d>1, we’ll see it   C will look better and we’ll choose it instead   In general, it’s best look far ahead

»  e.g., 60 moves

Nau: Game Theory 31

How to Search Deeper   Game trees grow exponentially with search depth

»  How to search to the tree deeply?   Key assumption: π accurately models the other agent’s future behavior   Then we can use dynamic programming

»  Makes the search polynomial in the search depth »  Can easily search to depth 60 »  Equivalent to solving an acyclic MDP of depth 60   This generates fairly good moves

(C,C)

Current iteration

(C,D) (D,C)

(D,D)

Next iteration

iteration after next :

:

:

: Nau: Game Theory 32

20th Anniversary IPD Competition http://www.prisoners-dilemma.com

  Category 2: IPD with noise   165 programs participated

  DBS dominated the

top 10 places   Two agents scored

higher than DBS   They both used

master-and-slaves strategies

Nau: Game Theory 33

Master & Slaves Strategy   Each participant could submit up to 20 programs   Some submitted programs that could recognize each other   (by communicating pre-arranged sequences of Cs and Ds)

  The 20 programs worked as a team

•  1 master, 19 slaves   When a slave plays with its master

•  Slave cooperates, master defects

My goons give me all their money …

=> maximizes the master’s payoff   When a slave plays with

an agent not in its team •  It defects

… and they beat up everyone else

=> minimizes the other agent’s payoff

Nau: Game Theory 34

Comparison   Analysis   Each master-slaves team’s average score was much lower than DBS’s   If BWIN and IMM01 had each been restricted to ≤ 10 slaves,

DBS would have placed 1st   Without any slaves, BWIN and IMM01 would have done badly

  In contrast, DBS had no slaves   DBS established cooperation

with many other agents   DBS did this despite the noise,

because it filtered out the noise

Nau: Game Theory 35

Summary   Finitely repeated games – backward induction   Infinitely repeated games   average reward, future discounted reward   equilibrium payoffs

  Non-equilibrium strategies   opponent modeling in roshambo   iterated prisoner’s dilemma with noise

•  opponent models based on observed behavior •  detection and removal of noise •  game-tree search against the opponent model   20th anniversary IPD competition

Nau: Game Theory 36