“May the Best Man Win!” Simulation optimization for match-making in e-sports
Ilya O. Ryzhov1
Awais Tariq2
Warren B. Powell2
1 Robert
H. Smith School of Business University of Maryland College Park, MD 20742
2 Operations
Research and Financial Engineering Princeton University Princeton, NJ 08544
INFORMS Annual Meeting November 15, 2011
1 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
2 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
3 / 34
Motivation: e-sports
The term “e-sports” refers to competitive multi-player online gaming Thousands of players simultaneously log on to networks such as Xbox Live or Battle.net
4 / 34
Motivation: e-sports
Revenues of South Korean game company NCSoft, 2000-2004 (Huhh 2008).
E-sports have become culturally significant and very profitable Top players from around the world compete professionally In 2005, Xbox Live had over 2 million subscribers; Battle.net has over 3 million registered players for a single game 5 / 34
Ranking and competition in e-sports
Game services and outside organizations create rankings of players Casual players are matched up automatically according to their skill level 6 / 34
Simulation optimization for match-making
We would like to create fair and challenging games by matching players of similar skill level The TrueSkill™ system used by Xbox Live views this as a Bayesian learning problem in which we sequentially learn players’ skills Unlike e.g. multi-armed bandit problems (Gittins 1989), the goal is to match a target rather than find the most skilled player We compare a value-of-information procedure to the greedy policy used by Microsoft
7 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
8 / 34
Mathematical model
Player i = 0, 1, ..., M has an underlying skill level si , unknown to the game master Our uncertainty about si is expressed as 2 si ∼ N µi0 , σi0 The performance of player i in a game is expressed as pi ∼ N si , σε2 We assume that performances and skill levels are independent
9 / 34
Non-conjugacy of Bayesian model
We say that player i beats player j if pi > pj in a game between these players Unfortunately, we never observe the exact values of pi or pj , only which player won Thus, the posterior belief P (si ∈ ds | pi > pj ) =
P (pi > pj | si = s) P (si ∈ ds) P (pi > pj )
is non-normal Conjugacy is forced using moment-matching (Minka 2001): plug the mean and variance of the posterior into a normal distribution
10 / 34
Moment-matching for approximate conjugacy Given the beliefs at time n, and the outcome of game n + 1 between i and j, update (Dangauthier et al. 2007) n n n 2 j µ n + (σi n) · v µi −µ if pin+1 > pjn+1 , i σ¯ij σ¯ijn n+1 µi = n n n 2 i µ n − (σi n) · v µj −µ if pin+1 < pjn+1 , i σ¯ij σ¯ijn µ n −µ n 2 (σin ) 2 n if pin+1 > pjn+1 , (σi ) 1 − σ¯ n · w i σ¯ n j ij ij n+1 2 σi = µ n −µ n 2 (σ n ) 2 if pin+1 < pjn+1 , (σin ) 1 − σ¯i n · w j σ¯ n i ij
with v (x) =
φ (x) Φ(x) ,
ij
w (x) = v (x) (v (x) + x), and σ¯ijn = (σin )2 + σjn
2
+ 2σε2 .
Intuitively: increase our skill estimate for the winning player. 11 / 34
Choosing an opponent
In Dangauthier et al. (2007), a game between i and j ends in a draw if |pi − pj | < δ for some small δ > 0. 12 / 34
The draw probability After n games, our prediction that the (n + 1)st game will end in a draw is P n pin+1 − pjn+1 < δ = IEn P n pin+1 − pjn+1 < δ | si , sj . For very small δ , 2
P n pin+1 − pjn+1 < δ | si , sj ≈ p
1 2π (2σε2 )
e
−
(si −sj ) 4σε2
δ.
We take δ → 0 and define the draw probability as 2 (si −sj ) − 1 qijn = IEn p e 4σε2 . 2 2π (2σε )
13 / 34
Choosing an opponent
Thus, the “probability” of a draw between players i and j is 2
1 qijn = √ r 2π
1 e 2 2 n n 2 σi + σj + 2σε
−
(µin −µjn ) 2 2 (σin ) +(σjn ) +2σε2
2
.
We expect the game to be more competitive when this quantity is higher.
14 / 34
Connection to DrawChance
The DrawChance formula given in Herbrich et al. (2006) is 2
v u n q˜ij = u t
2σε2 e 2 2 σin + σjn + 2σε2
−
(µin −µjn ) 2 2 n (σi ) +(σjn ) +2σε2
2
which is identical to qijn up to a constant scale factor The DrawChance policy used by Xbox Live greedily selects the match-up with the highest draw probability
15 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
16 / 34
Simulation optimization for match-making
We interpret the match-making problem as online simulation optimization with the objective N
sup
n ∑ q0,X
π n=0
π (µ n ,σ n )
where π is a policy for choosing opponents for a fixed player 0 The concept of value of information (Chick 2006) looks ahead to the outcome of the next decision This approach can be adapted to many types of objective functions (Frazier et al. 2008; Ryzhov & Powell 2011a)
17 / 34
Prediction of game outcome
Our Bayesian beliefs provide us with an (approximate) estimate of the outcome of a hypothetical game between i and j:
Proposition Under the normality assumption, the probability that player i beats player j in game n + 1 is given by n n µi − µj P n pin+1 > pjn+1 = Φ . 2 2 σin + σjn + 2σε2
18 / 34
Prediction of game outcome Proof. We compute P n pin+1 > pjn+1 ! s − s i j = IEn Φ p 2σε2 2
Z ∞
=
Φ −∞
x p 2σε2
!
−
1 s 2π
2 σin +
e
(x−(µi −µj )) 2 2 2 (σ n ) +(σ n ) i j
dx
2 σjn
2 and recast the last line as P (X ≤Y ) where X ∼ N 0, 2σε and 2 Y ∼ N µi − µj , (σin )2 + σjn are independent. 19 / 34
Value of information in match-making Let k w ,n+1 = µ w ,n+1 , σ w ,n+1 ,
k l,n+1 = µ l,n+1 , σ l,n+1
be the beliefs that we would have at time n + 1 if player 0 wins (or loses) against j w ,n+1 l,n+1 Similarly, let q0i (or q0i ) be the draw probabilities if player 0 wins (or loses)
The greedy policy would arrange the next game by computing w ,n+1 F w ,n+1 = max q0i , i
l,n+1 F l,n+1 max q0i i
depending on what happens now
20 / 34
Value of information in match-making If we stop learning after the next game, the optimal match-up is n X n = arg max q0j + (N − n) Fjn j
where Fjn = P n (0 beats j) F w ,n+1 + P n (j beats 0) F l,n+1 is the expected value (pre-game) of the highest draw probability (post-game) If the total number N of games is unknown, use n X n = arg max q0j + j
γ Fn 1−γ j
where γ is tunable 21 / 34
Experimental results: draw probabilities In simulations, our method behaved more aggressively than DrawChance...
22 / 34
Experimental results: difference in true skills ...pursued tougher opponents early on, but found better matches later...
23 / 34
Experimental results: errors of estimates ...produced better estimates of player 0’s true skill...
24 / 34
Experimental results: win/loss ratios ...and came closer to a 0.5 win/loss ratio.
25 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
26 / 34
...but that’s not the end!
In simulation optimization, we might tune a simulator to see how performance of a system could be improved But before the simulator can be optimized, we need to make sure that it is a good model of reality Targeting and selection: which simulation model most closely matches data from the field?
27 / 34
Targeting and selection Let c be a deterministic target (e.g. average historical performance) and consider M simulation models The mean output si of model i matches the target if |si − c| < δ We can simulate system i to obtain a noisy observation pi ∼ N si , σε2 , and apply Bayesian updating with no moment-matching required The “draw probability” in this context is given by 1 1 qin = √ q e 2 2π σin + σε2
−
(µ n −c )2 i 2 2 σ n +σε2 i
( )
.
28 / 34
The value of information Our goal is to maximize sup IE max qiN i
π
or its online equivalent, if we are refining an existing simulator Bayesian analysis tells us that, conditional on our beliefs at time n, n+1 n ˜n 2 µi ∼ N µi , (σi ) where (σ˜in )2 = (σin )2 − σin+1
2
The knowledge gradient approach simulates the system X n = arg max IEni max qjn+1 i
j
which is expected to yield the best result after the simulation 29 / 34
Issues for further work
The quantity IEni maxj qjn+1 can be computed in closed form If i is believed to be suboptimal with high precision, one simulation may yield no information (see also Ryzhov & Powell 2011b) 30 / 34
Outline
1
Introduction
2
TrueSkill™ model for learning skill levels Learning with moment-matching The DrawChance policy
3
Match-making with knowledge gradients
4
Moving on: Targeting and selection
5
Conclusions
31 / 34
Conclusions
We have studied online match-making through the framework of online optimal learning In simulations, a look-ahead policy offers some improvement over a greedy policy The formulation of the problem has interesting implications for future work in simulation optimization (Ryzhov 2011)
32 / 34
References Chick, S.E. (2006) “Subjective probability and Bayesian methodology.” In Handbooks of Operations Research and Management Science 13, 225–258. Dangauthier, P., Herbrich, R., Minka, T. & Graepel, T. (2007) “TrueSkill through time: revisiting the history of chess.” In Advances in Neural Information Processing Systems 20, 337–344. Frazier, P.I., Powell, W. & Dayanik, S. (2008) “A knowledge-gradient policy for sequential information collection.” SIAM J. on Control and Optimization 47:5, 2410-2439. Gittins, J. (1989) Multi-armed bandit allocation indices. John Wiley and Sons. Herbrich, R., Minka, T. & Graepel, T. (2006) “TrueSkill™: a Bayesian skill rating system.” In Advances in Neural Information Processing Systems 19, 569–576. 33 / 34
References
Huhh, J. (2008) “Culture and business of PC bangs in Korea.” Games and Culture 3:1, 26–37. Minka, T. (2001) “A family of algorithms for approximate Bayesian inference.” Ph.D. thesis, MIT. Ryzhov, I.O. (2011) “Targeting and selection: a new approach to simulation validation.” In preparation. Ryzhov, I.O. & Powell, W.B. (2011a) “Information collection on a graph.” Operations Research 59:1, 188–201. Ryzhov, I.O. & Powell, W.B. (2011b) “The value of information in multi-armed bandits with exponentially distributed rewards.” Proceedings of the 2011 International Conference on Computational Science, 1363–1372.
34 / 34