May the Best Man Win!

“May the Best Man Win!” Simulation optimization for match-making in e-sports Ilya O. Ryzhov1 Awais Tariq2 Warren B. Powell2 1 Robert H. Smith Sch...
Author: Eleanor Wood
3 downloads 2 Views 2MB Size
“May the Best Man Win!” Simulation optimization for match-making in e-sports

Ilya O. Ryzhov1

Awais Tariq2

Warren B. Powell2

1 Robert

H. Smith School of Business University of Maryland College Park, MD 20742

2 Operations

Research and Financial Engineering Princeton University Princeton, NJ 08544

INFORMS Annual Meeting November 15, 2011

1 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

2 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

3 / 34

Motivation: e-sports

The term “e-sports” refers to competitive multi-player online gaming Thousands of players simultaneously log on to networks such as Xbox Live or Battle.net

4 / 34

Motivation: e-sports

Revenues of South Korean game company NCSoft, 2000-2004 (Huhh 2008).

E-sports have become culturally significant and very profitable Top players from around the world compete professionally In 2005, Xbox Live had over 2 million subscribers; Battle.net has over 3 million registered players for a single game 5 / 34

Ranking and competition in e-sports

Game services and outside organizations create rankings of players Casual players are matched up automatically according to their skill level 6 / 34

Simulation optimization for match-making

We would like to create fair and challenging games by matching players of similar skill level The TrueSkill™ system used by Xbox Live views this as a Bayesian learning problem in which we sequentially learn players’ skills Unlike e.g. multi-armed bandit problems (Gittins 1989), the goal is to match a target rather than find the most skilled player We compare a value-of-information procedure to the greedy policy used by Microsoft

7 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

8 / 34

Mathematical model

Player i = 0, 1, ..., M has an underlying skill level si , unknown to the game master Our uncertainty about si is expressed as  2  si ∼ N µi0 , σi0 The performance of player i in a game is expressed as  pi ∼ N si , σε2 We assume that performances and skill levels are independent

9 / 34

Non-conjugacy of Bayesian model

We say that player i beats player j if pi > pj in a game between these players Unfortunately, we never observe the exact values of pi or pj , only which player won Thus, the posterior belief P (si ∈ ds | pi > pj ) =

P (pi > pj | si = s) P (si ∈ ds) P (pi > pj )

is non-normal Conjugacy is forced using moment-matching (Minka 2001): plug the mean and variance of the posterior into a normal distribution

10 / 34

Moment-matching for approximate conjugacy Given the beliefs at time n, and the outcome of game n + 1 between i and j, update (Dangauthier et al. 2007)   n n n 2  j  µ n + (σi n) · v µi −µ if pin+1 > pjn+1 , i σ¯ij σ¯ijn n+1 µi =  n n n 2  i  µ n − (σi n) · v µj −µ if pin+1 < pjn+1 , i σ¯ij σ¯ijn    µ n −µ n  2 (σin )  2 n  if pin+1 > pjn+1 ,  (σi ) 1 − σ¯ n · w i σ¯ n j  ij ij n+1 2   σi =  µ n −µ n  2 (σ n )  2  if pin+1 < pjn+1 ,  (σin ) 1 − σ¯i n · w j σ¯ n i ij

with v (x) =

φ (x) Φ(x) ,

ij

w (x) = v (x) (v (x) + x), and σ¯ijn = (σin )2 + σjn

2

+ 2σε2 .

Intuitively: increase our skill estimate for the winning player. 11 / 34

Choosing an opponent

In Dangauthier et al. (2007), a game between i and j ends in a draw if |pi − pj | < δ for some small δ > 0. 12 / 34

The draw probability After n games, our prediction that the (n + 1)st game will end in a draw is     P n pin+1 − pjn+1 < δ = IEn P n pin+1 − pjn+1 < δ | si , sj . For very small δ , 2

  P n pin+1 − pjn+1 < δ | si , sj ≈ p

1 2π (2σε2 )

e



(si −sj ) 4σε2

δ.

We take δ → 0 and define the draw probability as   2 (si −sj ) − 1 qijn = IEn  p e 4σε2  . 2 2π (2σε )

13 / 34

Choosing an opponent

Thus, the “probability” of a draw between players i and j is 2

1 qijn = √ r 2π

1 e  2 2 n n 2 σi + σj + 2σε



(µin −µjn )  2 2 (σin ) +(σjn ) +2σε2

 2

.

We expect the game to be more competitive when this quantity is higher.

14 / 34

Connection to DrawChance

The DrawChance formula given in Herbrich et al. (2006) is 2

v u n q˜ij = u t

2σε2 e  2 2 σin + σjn + 2σε2



(µin −µjn )  2 2 n (σi ) +(σjn ) +2σε2

 2

which is identical to qijn up to a constant scale factor The DrawChance policy used by Xbox Live greedily selects the match-up with the highest draw probability

15 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

16 / 34

Simulation optimization for match-making

We interpret the match-making problem as online simulation optimization with the objective N

sup

n ∑ q0,X

π n=0

π (µ n ,σ n )

where π is a policy for choosing opponents for a fixed player 0 The concept of value of information (Chick 2006) looks ahead to the outcome of the next decision This approach can be adapted to many types of objective functions (Frazier et al. 2008; Ryzhov & Powell 2011a)

17 / 34

Prediction of game outcome

Our Bayesian beliefs provide us with an (approximate) estimate of the outcome of a hypothetical game between i and j:

Proposition Under the normality assumption, the probability that player i beats player j in game n + 1 is given by   n n   µi − µj   P n pin+1 > pjn+1 = Φ  .  2  2 σin + σjn + 2σε2

18 / 34

Prediction of game outcome Proof. We compute   P n pin+1 > pjn+1 ! s − s i j = IEn Φ p 2σε2 2

Z ∞

=

Φ −∞

x p 2σε2

!



1 s 2π



2 σin +

e

(x−(µi −µj ))  2 2 2 (σ n ) +(σ n ) i j 

dx

 2  σjn

 2 and recast  the last line as P (X ≤Y ) where X ∼ N 0, 2σε and 2 Y ∼ N µi − µj , (σin )2 + σjn are independent. 19 / 34

Value of information in match-making Let  k w ,n+1 = µ w ,n+1 , σ w ,n+1 ,

  k l,n+1 = µ l,n+1 , σ l,n+1

be the beliefs that we would have at time n + 1 if player 0 wins (or loses) against j w ,n+1 l,n+1 Similarly, let q0i (or q0i ) be the draw probabilities if player 0 wins (or loses)

The greedy policy would arrange the next game by computing w ,n+1 F w ,n+1 = max q0i , i

l,n+1 F l,n+1 max q0i i

depending on what happens now

20 / 34

Value of information in match-making If we stop learning after the next game, the optimal match-up is n X n = arg max q0j + (N − n) Fjn j

where Fjn = P n (0 beats j) F w ,n+1 + P n (j beats 0) F l,n+1 is the expected value (pre-game) of the highest draw probability (post-game) If the total number N of games is unknown, use n X n = arg max q0j + j

γ Fn 1−γ j

where γ is tunable 21 / 34

Experimental results: draw probabilities In simulations, our method behaved more aggressively than DrawChance...

22 / 34

Experimental results: difference in true skills ...pursued tougher opponents early on, but found better matches later...

23 / 34

Experimental results: errors of estimates ...produced better estimates of player 0’s true skill...

24 / 34

Experimental results: win/loss ratios ...and came closer to a 0.5 win/loss ratio.

25 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

26 / 34

...but that’s not the end!

In simulation optimization, we might tune a simulator to see how performance of a system could be improved But before the simulator can be optimized, we need to make sure that it is a good model of reality Targeting and selection: which simulation model most closely matches data from the field?

27 / 34

Targeting and selection Let c be a deterministic target (e.g. average historical performance) and consider M simulation models The mean output si of model i matches the target if |si − c| < δ We can simulate system i to obtain a noisy observation  pi ∼ N si , σε2 , and apply Bayesian updating with no moment-matching required The “draw probability” in this context is given by 1 1 qin = √ q e 2 2π σin + σε2



(µ n −c )2   i 2 2 σ n +σε2 i

( )

.

28 / 34

The value of information Our goal is to maximize sup IE max qiN i

π

or its online equivalent, if we are refining an existing simulator Bayesian analysis tells us that, conditional on our beliefs at time n,   n+1 n ˜n 2 µi ∼ N µi , (σi ) where (σ˜in )2 = (σin )2 − σin+1

2

The knowledge gradient approach simulates the system X n = arg max IEni max qjn+1 i

j

which is expected to yield the best result after the simulation 29 / 34

Issues for further work

The quantity IEni maxj qjn+1 can be computed in closed form If i is believed to be suboptimal with high precision, one simulation may yield no information (see also Ryzhov & Powell 2011b) 30 / 34

Outline

1

Introduction

2

TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy

3

Match-making with knowledge gradients

4

Moving on: Targeting and selection

5

Conclusions

31 / 34

Conclusions

We have studied online match-making through the framework of online optimal learning In simulations, a look-ahead policy offers some improvement over a greedy policy The formulation of the problem has interesting implications for future work in simulation optimization (Ryzhov 2011)

32 / 34

References Chick, S.E. (2006) “Subjective probability and Bayesian methodology.” In Handbooks of Operations Research and Management Science 13, 225–258. Dangauthier, P., Herbrich, R., Minka, T. & Graepel, T. (2007) “TrueSkill through time: revisiting the history of chess.” In Advances in Neural Information Processing Systems 20, 337–344. Frazier, P.I., Powell, W. & Dayanik, S. (2008) “A knowledge-gradient policy for sequential information collection.” SIAM J. on Control and Optimization 47:5, 2410-2439. Gittins, J. (1989) Multi-armed bandit allocation indices. John Wiley and Sons. Herbrich, R., Minka, T. & Graepel, T. (2006) “TrueSkill™: a Bayesian skill rating system.” In Advances in Neural Information Processing Systems 19, 569–576. 33 / 34

References

Huhh, J. (2008) “Culture and business of PC bangs in Korea.” Games and Culture 3:1, 26–37. Minka, T. (2001) “A family of algorithms for approximate Bayesian inference.” Ph.D. thesis, MIT. Ryzhov, I.O. (2011) “Targeting and selection: a new approach to simulation validation.” In preparation. Ryzhov, I.O. & Powell, W.B. (2011a) “Information collection on a graph.” Operations Research 59:1, 188–201. Ryzhov, I.O. & Powell, W.B. (2011b) “The value of information in multi-armed bandits with exponentially distributed rewards.” Proceedings of the 2011 International Conference on Computational Science, 1363–1372.

34 / 34