Prediction of Tennis Results

Prediction of Tennis Results Simon Mathis Supervisor: Sandor Szedmak Intelligent and Interactive Systems Institut f¨ ur Informatik University of Innsb...
Author: Guest
4 downloads 0 Views 2MB Size
Prediction of Tennis Results Simon Mathis Supervisor: Sandor Szedmak Intelligent and Interactive Systems Institut f¨ ur Informatik University of Innsbruck May 24, 2012

1

Abstract In machine learning the prediction of sports results is very popular because of the huge amount of data collected by most of the sports associations, like NBA (basketball), MLB (baseball), ATP (tennis), and many more. Most people are familiar with these sports and can imagine the influence of different factors to the result of the prediction, also there are betting offices which make a ”prediction” themselves setting odds. The real task in sports betting is not only to predict who, or which team will win, but with what chances this event will occur. As the Support Vector Machine(SVM) can only predict binary results ({+1, −1}) we use the so called maximum margin regression (MMR) [12] to predict a probability for winning. And try to win against the betting companies. I will give a brief description of Support Vector Machines [5] and kernel based methods. As well as a descpription of the MMR in relation to SVM. There is also a description of the features used in the prediction and how they are computed. The data source and how betting companies compute their odds and make their profit are included as well.

2

Contents 1 Introduction 1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Matchmaking in online gaming . . . . . . . . . . . . . 1.1.2 Financial game . . . . . . . . . . . . . . . . . . . . . .

6 7 7 7

2 Data 2.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 8 8

3 Betting and Odds

10

4 Methods 4.1 Kernel 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5

11 11 11 12 13 13 14

Based Methods . . . . . . . . . Kernel . . . . . . . . . . . . . . Characterisation of Kernels . . Support Vector Machines . . . Maximum margin classification Soft margin classification . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Reinterpretation of the normal vector w 15 5.1 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6 Prediction

17

7 Feature Vectors 18 7.1 Relation features . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 The base problem, formal description 9 Optimization problem 9.1 The base problem . . . . . . . . . . . . . . . . . . . . 9.2 The general schema of conditional gradient method . 9.2.1 Casting the concrete problem into the general 9.3 Solution of the linear subproblem . . . . . . . . . . .

20

. . . . . . . . . . schema . . . . .

22 23 23 25 25

10 Implementation 26 10.1 Parameter validation and testing procedure . . . . . . . . . . 27 10.2 Programme flow . . . . . . . . . . . . . . . . . . . . . . . . . 27 11 Results 28 11.1 Measuring the accuracy of the binary prediction . . . . . . . 28 11.2 Measuring the prediction of the odds . . . . . . . . . . . . . . 31 12 Conclusion and Discussion

34 3

13 Appendix

35

4

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13

Sample of the provided data . . . . . . . . . . . . . . . . . . . Computation of the betting fee . . . . . . . . . . . . . . . . . Normal distribution 1 . . . . . . . . . . . . . . . . . . . . . . Parameter validation and testing procedure . . . . . . . . . . Programme flow . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy of the prediction with 3 month calculation period . Accuracy of the prediction with 6 month calculation period . Money earned in betting without considering the fee of the betting vendors for 3 month calculation period . . . . . . . . Money earned in betting with considering the fee of the betting vendors for 3 month calculation period . . . . . . . . . . Money earned in betting without considering the fee of the betting vendors for 6 month calculation period . . . . . . . . Money earned in betting with considering the fee of the betting vendors for 6 month calculation period . . . . . . . . . . Evaluation of betting odds . . . . . . . . . . . . . . . . . . . . Table showing how betting companies would gain against each other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 11 20 27 28 30 30 31 32 32 33 33 34

List of Tables 1 2 1

Receiver operator characteristics . . . . . . . . . . . . . . . . Notation used in the paper . . . . . . . . . . . . . . . . . . .

http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

5

29 35

1

Introduction

In machine learning the prediction of sports results is very popular because of the huge amount of data collected by most of the sports associations, like NBA (basketball), MLB (baseball), ATP (tennis), and many more. Most people are familiar with these sports and can imagine the influence of different factors to the result of the prediction, also there are betting offices which make a ”prediction” themselves setting odds. The real task in sports betting is not only to predict who, or which team will win, but with what chances this event will occur. As the Support Vector Machine(SVM) can only predict binary results ({+1, −1}) we use the so called maximum margin regression (MMR) [12] to predict a probability for winning. And try to win against the betting companies. I will give a brief description of Support Vector Machines [5] and kernel based methods. As well as a description of the MMR in relation to SVM. There is also a description of the features used in the prediction and how they are computed. The data source and how betting companies compute their odds and make their profit are included as well. At the start there will be the Data section (2) to provide the background information with what type of data the computations will be made, and follow up with the explanation of betting odds itself, how they are computed and what the secret of betting companies is to make assured profit, in Section 3. The following part of the Thesis will be about the classical methods to solve such prediction problems, with a description of kernels as a start and some kernel based methods like maximum margin classification in Section 4. Section 5 will describe the difference between a SVM and the method used in the Thesis the MMR, and describe the differences in computation. After this Section 7 will describe the feature vectors used in the context of tennis prediction. The following Section 8 describes the mathematical issues of the computation and what methods exactly have been used. Section 9 will describe the solution to the optimization problems that are derived in Section 8, and the general scheme of the conditional gradient method. The implementation Section (10) will give a short description how the computation in Matlab is really done in schemes, to get a better access to the theme, finally Section 11 gives the results of the Thesis and shows how good the prediction itself is, considering raw {+1, −1} prediction on the one side and playing against betting companies on the other side. The general description of the learning method and the optimization technique for the Thesis can be found in [12], and a similar framework used for large scale recommender systems can be found in [8] [7]. The task can be described as so called matrix completition problem.

6

1.1

Applications

Some short sketches are given here where this method could possibly be used. 1.1.1

Matchmaking in online gaming

The system discussed in this Thesis has probably more than the applications already mentioned. For example there are many different online games that have the problem of finding good opponents that match the skill level of the player. The Elo system is used in many games e.g. League of Legends [2]. But the number of people complaining about the so called ”Elo Hell” [1] is very huge, they feel you can gain or lose Elo-points somehow random because League of Legends is a 5v5 game and you always get bad players in your team. One issue is the start of the ranking everybody has elo of 1200 and there are only 10 placement matches where you can gain or lose more Elo than usual. With the MMR it would probably be possible to make better matches, with equal opponents, and not only use the match result in the calculation of the ranks but other factors like Kill-Death-Assist ratio (KDA) or Towers Killed. Another alternative for skill rating can be found in [6]. 1.1.2

Financial game

Another possible application is the stock market. For example you can interpret the gain or loss between two currencies for one day as a match between them and in this way obtain the data for the model. And as there is a large amount of currency rates you can get the feature vectors if you look at the ”matches” between each other. For this we could probably drop results if their changing highly correlates. It could probably be very nice to predict because of the completeness of the data.

7

Figure 1: Sample of the provided data

2

Data

2.1

Data source

The data used for this Thesis is from http://tennis-data.co.uk, there are Data about every year in a different csv file which contains all the matches played in that year. Despite the names of the two players facing each other there is not only the winner and the number of sets he has won but the outcome of every single set along with information about the surface on which the match was played and the tournament level. There are odds for each match from 5 different betting vendors (Bet365, Expect, Ladbrokes, Pinnacles, Stan James) provided too. An example of the data can be seen in Figure 1. Here you can see the player names (columns Winner, Loser), their respective ranks (WRa, LRa) and ranking points (WPts, LPts) the games won by the first player and the second player (W1,L1,...,W3,L3), as well as the sets won by each player (WSets,LSets) and the quotes of the first company for the respective match (B365W, B365L). The site itself provides information about Tennis games all over the world played in the WTA and ATP circuits and gives direct links to betting companies.

2.2

Interpretation

We are given a set players P and a set of matches M ⊆ P × P × T , where T is a set of time labels marking the time when the match happened. In this way every match is characterized by a tuple (pi , pj , t) saying that pi played a match with pj at time t. We can assign other attributes to the matches: • the concrete outcome, e.g. in tennis the results of the games within a match, • the ranks of the players in that time when the match was played, • forecasts of the outcome of the matches provided by experts. 8

The task is to predict a possible result of a match before that match is played based on the data given by the matches to which the results and other attributes are available. Some notations should be introduced to simplify the description of the prediction procedure. nP = card(P) is equal to the number of players In the first case suppose that result of a match is given, and we are going to predict this value for a new match. To this end let us create a matrix R with size nP × nP which contains the real numbers rij reflecting the results of those matches which have been played between two players pi and pj . These real numbers are computed by a function acting on the set of the results of the common matches and it is defined for player pi and pj namely P |(si − sj )|d1 rij = sgn(si − sj ) , (1) nd2 where si is the number of sets won by player pi , and sj is the number of sets won by player pj . And d1 , d2 are parameters to modify and n the number of matches between player i and player j. Other approaches to compute a baseline could be:   si + o log sj + o where o is an offset to avoid division by zero. Or just computing the set difference: si − sj To avoid differences between best of 5 and best of 7 matches the matches are normalized as follows: si si = max(si , sj ) Some properties of this type of approach: • (1) can allow to estimate the probability that player pi can win based on past, • if there were no any common matches between player pi and pj then rij = 0, thus the probability to win for both player is the same, • rij + rji = 0 the probabilities are complementary, These probabilities are based on only the matches played between two players no possible interdependence on matches played with the others are considered. The prediction of the result of a match between players pi and pj now ˜ and if the boils down to the prediction of the element r˜ij of the matrix R, 9

predicted value is positive than pi otherwise pj is the winner. To write up the concrete learning task predicting those values we need the define the feature representation of the entities, players and match outcomes, this will be detailed in Section 7. Remark 1. It was a problem that the original data was ordered, there was always the player that won the game in the first position and the one who lost in the second. So the prediction tending to predict the first player not the one who is more likely to win. Based on the pseudorandom generator we swap the players in a way that the winner of the match can be the first player or the second player with the same chance.

3

Betting and Odds

As soon as you start predicting sports result, no matter what sports and for what events, betting and odds have to be mentioned. The theory in this Section is just a estimation about the way companies compute their odds, because it is very hard to find a exact description of any betting company about how they compute their odds. In Firure 2 you can see the value of o11 + o12 for a agency, with this value the following can be assumed: A odd means the inverse ratio of probability for any event p. The probability against and for the event follows the Bernoulli distribution with parameters p, q where p + q = 1. The odds can be computed as: o1 = and o2 =

1 p

1 1−p

Example 2. Assume p=0.5 therefore q = 1 − p = 0.5 and the same amount of money m is set on the event and against the event. So the odds are o1 = 2 and o2 = 2. We see that the betting company would have to pay back the same amount of money as they received. To ensure their profit betting companies usually compute their odds with p + q = 1 + f where f is the additional profit of the company itself. As mentioned in Section 2 the data used is containing betting odds from multiple betting companies. By computing this f from the data a f of 0.06 to 0.08 was observed, which means that the betting companies add additional 6 to 8 percent to the odds to ensure their own profit. So the challenge 10

Figure 2: Computation of the betting fee in predicting sports results is not only to play against probability, but to play against probability and the betting companies fee. If you manage win against the randomness, you still have to overcome the other ˜8% gap of the companies odds.

4

Methods

4.1 4.1.1

Kernel Based Methods Kernel

To understand Kernel based methods its essential to know what a Kernel is, [11]. The following section will shortly describe what a Kernel is and what the characteristics are. This section gives a summary over the Sections 3 and 5 of the book ”An introduction to Support Vector Machines and other kernel-based learning methods” by Nello Cristianini and John Shawe-Taylor [5]. A Kernel function is a function K such that for all x, z ∈ X K(x, z) = hφ(x) · φ(z)i

(2)

where φ is a mapping from x to an (inner product) feature space F [5] One of the important consequence is that the dimension of the feature space doesn’t affect the computation. It makes it possible to map some data into feature space, and to train the machine there. The only necessary information used for training examples is their kernel matrix(K). If we have this K it takes almost `, the number of items, steps to compute the decision rule based on the theory of reproducing Hilbert Spaces. f (x) =

` X

αi γi K(xi , x) + b

i=1

11

(3)

4.1.2

Characterisation of Kernels

In this section some of the most important characterizations of Kernels are pointed out. I will focus on Mercers Theorem, see [5]. Theorem 3 (Mercer). Let X be a compact subset of R suppose K is a continuous symmetric function such that the integral operator TK : L2 (X ) → L2 (X ), Z K(·, x)f (x)dx

(TK f )(·) =

(4)

X

is positive, that is Z K(x, z)f (x)f (z)dxdz ≥ 0

(5)

X ×X

for all f ∈ L2 (X ). Then we can expand K(x,z) in a uniformly convergent series (X × X) in terms of TK ’s eigen-functions φj ∈ L2 (X ), normalised in such a way that kφj kL2 = 1, and positive associated eigenvalues λj ≥ 0 K(x, z) =

∞ X

λj φj (x)φj (z)

(6)

j=1

Proposition 4. There are some rules that are applicable to Kernels: K1 and K2 are kernels over X × X , X ⊆ Rn , a ∈ R+ , f (·) a real value function on X φ : X → Rm (7) and K3 a kernel over Rm × Rm , and B a smmetric positive semi-definite n × n matrix the following functions are kernels: • K(x, z) = K1 (x, z) + K2 (x, z) • K(x, z) = aK1 (x, z) • K(x, z) = K1 (x, z)K2 (x, z) • K(x, z) = f (x)f (z) • K(x, z) = K3 (φ(x), (φ(z)) • K(x, z) = x0 Bz this is helpful as we can compute more complex kernel out of simple ones. For details and proof see [5].

12

4.1.3

Support Vector Machines

Further on in this section the Support Vector Machine (SVM) will be described. Because the method used in this Thesis is derived from it. The aim of a SVM is to find the best possible separating hyperplanes in a high dimensional feature space. We can for example optimize the maximal margin, the margin distribution or the number of support vectors etc. 4.1.4

Maximum margin classification

As probably the easiest example for a SVM the Maximal Margin Classifier is briefly described. The Maximal Margin Classifier was also the first one introduced. It’s disadvantage is that it only works with linear separable data in a feature space, which does not occur too many times in real-world. Proposition 5. Given a linearly separable training sample S = (((x)1 , (y)1 ), . . . , ((x)` , (y)` ) the hyperplane ((w, b)) that solves the optimization problem min(w,b) hw · wi subject to yi (hw · xi + bi ≥ 1), i = 1, . . . , `. realises the maximal margin hyperplane with geometric margin y =

(8) 1 kwk2

We can derive from the Lagrangian from the primal [4]: l

L(w, b, α) =

X 1 hw · wi − αi [yi (hwi · xi ) − 1]i 2 i=1

the corresponding dual form with differentiating to w and b and substituting the relations obtained, ` X w= yi αi xi i=1

0=

` X

yi αi

i=1

into the primal to obtain L(w, b, α) =

` X i=1

αi −

` 1 X yi yj αi αj hxi · xj i 2 i,j=1

13

Proposition 6. Consider the linearly separable training sample S = ((x1 , y1 ), . . . , (x` , y` )) and suppose the parameters α∗ solve the following quadratic optimization problem: P P maximise W(α) = `i=1 αi − 12 `i,j=1 yi yj αi αj hxi · xj i , P` (9) subjectto i=1 yi αi = 0. αi ≥ 0, i = 1, . . . , l P Then the weight vector w∗ = `i=1 yi αi∗ xi realizes the maximal margin hyperplane with geometric margin 1 y= kw∗ k2 . 4.1.5

Soft margin classification

In this Section there is a short description of the more practicable soft margin classification method. This method allows to classify non linear separable data as it introduces the slack variables ξi . This classification method will split the data as cleanly as possible with a hyperplane. while the ξi measures the degree of misclassification. The optimization problem from 4.1.4 now becomes: ( ) n X 1 min kwk2 + C ξi w,ξ,b 2 i=1

subject to: yi (w · xi − b) ≥ 1 − ξi ,

ξi ≥ 0

with the objective of minimizing kwk leads to the problem: ( ) n n n X X X 1 min max kwk2 + C ξi − αi [yi (w · xi − b) − 1 + ξi ] − βi ξi w,ξ,b α,β 2 i=1

i=1

i=1

with αi , βi ≥ 0 which leads us to the corresponding dual form: n X 1X ˜ αi αj yi yj k(xi , xj ) L(α) = αi − 2 i=1

i,j

subject to (for any i = 1, . . . , n) 0 ≤ αi ≤ C, and

n X

αi yi = 0.

i=1

14

5

Reinterpretation of the normal vector w

As already mentioned the method used for the prediction in this Thesis is derived from the SVM, in this section the differences are shown. The normal vector w formally behaves as a linear transformation acting on the feature vectors which makes rise the idea to extend the capability of the original schema. This reinterpretation can be characterized briefly in the following way SVM • w is the normal vector of the separating hyperplane. • yi ∈ {−1, +1} binary outputs. • The labels are equal to the binary objects.

ExtendedView • W is a linear operator projecting the feature space into the label space. • yi ∈ Y arbitrary outputs • ψ(yi ) ∈ Hψ are the labels, the embedded outputs in a linear vector space

If we apply a one-dimensional normalized label space invoking binary labels {−1, +1} in the general framework one can restore the original scenario of the SVM, and the normal vector is a projection into the one dimensional label space. The geometry of the extended learning problem can be demonstrated by decomposing the linear transformation W into elementary transformations of the Eucledian space. Furthermore we can include the translation operator by including vector based bias, thus the full transformation is an affine one being equal to a linear transformation + translation. The components of the linear transformation can be derived by singular value decomposition W = UDV0 and they play the following roles Output configuration

Transformation

Input configuration

⇐= (=⇒) U= Rotation Scaling D= Projection V = Rotation + b = Translation Thus, the extended form of the SVM tries to find an affine transformation which maps the configuration of the input items to gain the highest similarity between the image of the inputs and the outputs. 15

In summarizing the learning task we end up in the following optimization problem presented parallel with the original primal form of the SVM to emphasize the similarities and dissimilarities between the original and the extended form. Primal problems for maximum margin learning Binary class learning Vector label learning Support Vector Machine(SVM) Maximum Margin Robot(MMR) min

1 2

w0 w +C10 ξ | {z }

1 2

kwk22

w.r.t.

kWk2F

w : Hφ → R, normal vec.

W : Hφ → Hψ , linear operator,

b ∈ R, bias, ξ ∈ Rm , error vector, s.t.

tr(W0 W) +C10 ξ | {z }

b ∈ Hψ , translation(bias), ξ ∈ Rm , error vector,

yi (w0 φ(xi ) + b) ≥ 1 − ξi ,

ψ(yi ), Wφ(xi ) + b H

ψ

≥ 1 − ξi ,

ξ ≥ 0, i = 1, . . . , m.

ξ ≥ 0, i = 1, . . . , m,

(10) In the extended formulation we exploit the fact the Frobenius norm and the Frobenius inner product correspond to the linear vector space of matrices with the dimension being equal to the number of elements of the matrices, hence it gives an isomorphism between the space spanned by the normal vector of the hyperplane occurring in the SVM and the space spanned by the linear transformations. One can recognize that if no bias term included in the MMR problem then we have a completely symmetric relationship between the label and the feature space via the representations of the input and the output items, namely





ψ(yi ), Wφ(xi ) H = W∗ ψ(yi ), φ(xi ) H = φ(xi ), W∗ ψ(yi ) H . ψ

φ

φ

Thus, in predicting the input items as image of a linear function defined on the outputs the adjugate of W, W∗ , is involved. This adjugate is equal to the transpose of the matrix representation of W when both the label space and the feature space are finite.

5.1

Dual problem

The dual problem of the MMR presented in the right column of (10) is given by

16

κφ ij

κψ ij

z }| {z }| { P Pm m min i,j=1 αi αj hφ(xi ), φ(xj )i hψ(yi ), ψ(yj ))i − i=1 αi , w.r.t. α ∈ R, Pi m s.t. i=1 (ψ(yi ))t αi = 0, t = 1, . . . , dim(Hψ ), 0 ≤ αi ≤ C, i = 1, . . . , m. φ κij kernel items corresponding to the feature vectors, κψ kernel items corresponding to the label vectors ij The symmetry of the objective function is clearly recognizable showing that the underlying problem without bias is completely reversible. The explicit occurrences of the label vectors can be transformed into implicit ones by exploiting that the feasibility domain covered by the constraints: m X (ψ(yi ))t αi = 0, t = 1, . . . , dim(Hψ ), i=1

coincides with a domain m X

κψ ij αi = 0, j = 1, . . . , m

i=1

referring only to inner products of the label vectors

6

Prediction

After solving the dual problem with the help of the optimum dual variables we can write up the optimal linear operator P 0 W = m i=1 αi ψ(yi )φ(xi ) . Comparing this expression with the corresponding formula which gives the optimal solution to the SVM, i.e. w=

m X

αi yi φ(xi ),

i=1

we can see the new part includes the vectors representing the output items which in the SVM were only scalar values but we could say in the new interpretation they are one-dimensional vectors. With the expression of the linear operator W at hand the prediction to a new input item x can be written up by ψ(y) = Wφ(x) P = m i ), φ(x)i . i=1 αi ψ(yi ) hφ(x | {z } κφ (xi ,x)

17

It involves only the input kernel κφ and provides the implicit representation of the prediction ψ(y) to the corresponding output y. Because only the implicit image of the output is given we need to invert the function ψ to gain the y. This inversion problem is called as pre-image problem. Unfortunately there is no general procedure to do that. We mention here a schema that can be applied when the set of all possible outputs is finite with a reasonable small cardinality. The meaning of the “reasonable small” cardinality depends on the given problem, e.g. how expensive to compute the inner product between the output items in the label space where they are represented. At the conditions mentioned we can follow this scenario y ∈ Ye ⇐ Set of the possible outputs y∗ = arg maxy∈Ye ψ(y)0 Wφ(x) κψ (y,yi )

= arg maxy∈Ye y

κφ (xi ,x)

}| { z }| {z 0 α hψ(y), ψ(y )i hφ(x ) φ(x)i i i i=1 i

Pm

∈ Ye = {y1 , . . . , yK }, K  ∞

The main advantage of this approach is that it requires only the inner products in the label space, in turn, it is independent from the representation of the output items and can be applied in any complex structural learning problem, e.g. on graphs. Probably the best candidate for Ye could be the training set.

7

Feature Vectors

We can create feature vectors which can characterize a player pi based on ˜ his(her) earlier performance by choosing the row i or the column i of R,see details in 2, since these vectors containing all the results of the matches relating to player pi . We will denote the corresponding features by φri , to the row vector, and φci to the column vectors. Some notations, let the general knowledge we know about Player i be denoted by xi . The collection of the knowledge about all players is denoted by X , thus xi ∈ X . We do not assume any special structure about this space, we might say this space consists of all information that we have about players. Now the feature vector of the players is a function φ : X → Hφ which assigns to the knowledge about a player xi a vector φ(xi ) of a Hilbert space Hφ . As we see the space of the feature vectors has the special structure, namely it is a linear vector space equipped with inner product, thus this representation can be used in a computation procedure. In some part of the

18

text the feature vectors mentioned as representation of the knowledge about the players. In this notation context we can state that φ(xi ) is equal to the row vector ˜ of matrix R Example 7. Matches: Players 1 2 3 4 5 1 1, 5 1 1, 8 2 −1, 5 3 −1, 7 3 −3 1, 34 −1 −1, 34 3, 2 4 5 −1, 8 1, 7 −3, 2

7.1

(11)

Relation features

Feature vectors representing the relationship between two players, who might be the stronger or the weaker, is based on the r˜ij values. This values can be interpreted as estimation of the underlying random variables, thus we may apply an estimation of the distribution of these variables as features. This distribution can express both our prior knowledge and the same time our uncertainty on these values. Let the feature vector representing r˜ij be the density function of a normal distribution (see Figure 3) which is defined by ψ(˜ rij ) = f (.|˜ rij , σ) ∼ N (˜ rij , σ 2 ),

(12)

thus we have a normal density having expected value equal to r˜ij , the prior knowledge, and σ is the standard deviation of this density reflecting our uncertainty. Some remarks are about this type of feature vectors: • We assume that all feature vector share the same standard deviation σ. • σ should be find for example by cross-validation. Since these functions are square integrable functions therefore we can find a Hilbert space of square integrable functions which contains our feature vectors. Let us denote this Hilbert space by Hψ . In space Hψ we can compute inner products, and in this way distance, and angle between any two feature vectors. These feature vectors as functions have infinite dimensions but their inner product can be computed straightforwardly, namely the inner products are elements of a Gaussian kernel matrix with width 2σ 2 , k˜ rij − r˜kl k22 ), (13) 2σ 2 where CGaussian is a constant that we can drop since it has no effect on the outcome of the learning procedure. hψ(˜ rij ), ψ(˜ rkl )i = CGaussian exp(−

19

Figure 3: Normal distribution

8

2

The base problem, formal description

There is given a set of players P = {p1 , . . . , pM } . There is given a set rating function r : D ⊆ P × P → R, whose value r(p, q) represents how a player p ∈ P played against player q ∈ P. We will use the notation rpq instead of r(p, q) in the sequel. Note that not all pair of players have necessary a value rpq . Let Dp be the set of players who played against player p, and Dq be the set of players who played with player q. Based on the match outcomes we define the residual ratings by an additive model rˆpq = rpq − r¯p − r¯q + r¯, (14) where r¯p =

1 X 1 X 1 X rpq , r¯q = rpq , r¯ = rpq , |Dp | |Dq | |D| q∈Dp

p∈Dq

(15)

(p,q)∈D

Note that the expected value of the residual rank is equal to zero. The knowledge about an player p is denoted by xp and it is represented in a Hilbert space Hφ by the map φ, thus the representation is denoted by φ(xp ) for all p ∈ P. Similarly the residual ranking assigned to the pair (p, q) is represented in another Hilbert space Hψ by the map ψ and the value of this map is denoted by ψ(ˆ rpq for all (p, q) ∈ D. We assume about these representations that there are kernels functions in Hφ and Hψ , κφ and κψ such that these kernel function provide the value of the inner products, namely κφ (qi , qj ) = hφ(qi ), φ(qj )i , κψ (ˆ rpq , rˆjv ) = hψ(ˆ rpq ), ψ(ˆ rjv )i .

(16)

To learn the relationship between the residual ranking and the pairs of

20

the players the following optimization problem can be formulated P 1 2 min i∈I ζi 2 kWk + C w.r.t. W, ζ s.t. hψ(ˆ rpq ), Wq φ(p)i ≥ 1 − ζp , (p, q) ∈ D, ζp ≥ 0, p ∈ P.

(17)

Note that minimization will be achieved when the vectors Wq φ(p) are as uniformly aligned as possible with the vector ψ(ˆ rpq ). To solve this constrained optimization problem, we define the Lagrangian X 1X L= kWq k2 + C ζp 2 q∈P p∈P X  X αpq hψ(ˆ rpq ), Wp φ(q p )i − 1 + ζp − − λ p ζp p∈P

p,q)∈D

where αpq ≥ 0 and λp ≥ 0 are Lagrange multipliers to ensure the constraints hψ(ˆ rpq ), Wq φ(q p )i ≥ 1 − ζp and ζp ≥ 0 respectively. The optimum mapping is found by solving min

max

{Wq },{ζp } {αpq },{λp }

L

subject to the constraints that αpq ≥ 0 for all (p, q) ∈ D and λp ≥ 0 for all p ∈ P. For a general linear mapping, Wq , we have that ∂ hψ(ˆ rpq ), Wq φ(q p )i = ψ(ˆ rpq ) ⊗ φ(q p ) ∂Wq where ⊗ is the tensor-product of the two vectors. This is clearly the case when the Hilbert spaces of finite dimensions where the mapping Wu can be represented by a matrix, but this can be extended for more general linear mappings between any Hilbert spaces at certain conditions. Thus, X ∂L αpq ψ(ˆ rpq ) ⊗ φ(q p ) = Wq − ∂Wq i∈Dq

The Lagrangian is minimized with respect to Wq when Wq = φ(q p ). Taking derivatives with respect to ζp we find

P

p∈Dp

αpq ψ(ˆ rpq )⊗

X ∂L =C− αpq − λp . ∂ζp q∈Dp

Setting these derivatives to 0 we find that the Lagrangian is minimized with respect to ζp when X αpq = C − λp ≤ C q∈Dp

21

where the inequality arises because λp ≥ 0. After substituting back the expressions containing only the Lagrangian multipliers into the Lagrangian we receive the dual problem of (17) which is a maximization problem with respect to the variables αpq 1X X αpq αp0 q hψ(ˆ rpq ), ψ(ˆ rp0 q )ihφ(q p ), φ(q p0 )i 2 0 q∈P p,p ∈Dq X αpq +

f (α) = −

(p,q)∈D

subject to the constraint that α ∈ Z(α) where     X Z(α) = α ∀p ∈ P, αpq ≤ C ∧ ∀(p, q) ∈ D, αpq ≥ 0 .   q∈Dp

We are now in the position where we can apply the usual kernel trick. The kernel functions can be defined by Krˆ(ˆ rpq , rˆp0 q ) = hψ(ˆ rpq ), ψ(ˆ rp0 q )i Kq (q p , q p0 ) = hφ(q p ), φ(q p0 )i, and then we can write dual problem as P P P min f (α) = − 12 q∈P p,p0 ∈Dq αpq αp0 q Krˆ(ˆ rpq , rˆp0 q )Kq (q p , q p0 ) + (p,q)∈D αpq P s.t. q∈Dp αpq ≤ C, ∀p ∈ P αpq ≥ 0, (p, q) ∈ D, (18) where we are free to choose any pair of positive definite kernel functions.

9

Optimization problem

We are going to solve the dual of the problem given by (18). Since the number of dual variables is equal to the cardinality of D the complete optimization problem can be very large. To overcome on the complexity caused by the number of dual variables we can exploit the special structure of the problem. To realize an efficient approach a conditional gradient based decomposition algorithm has been developed. This algorithm has its roots in that one which was proposed in [10] and [9] to solve structural learning problems. The theoretical background, e.g. analysis of the convergence, of the conditional gradient algorithm can be found for example in [3] [4]. Here we focus on the version of this algorithm which can solve our concrete problem.

22

9.1

The base problem

To simplify the description of the algorithm which solves the problem given by (18) some notations ought to be introduced, namely αq = (αpq ), p ∈ Dq , αp = (αpq ), q ∈ Dp , Ku = [Krˆ(ˆ rpq , rˆp0 q )Kq (q p , q p0 )], p, p0 ∈ Dq ,

(19)

where the two sets of vectors {αq } and {αp } represent two independent partition of the set of all dual variables given by α. Based on this notations the dual problem (18) can be written as P 1P min q∈P hαq , Kq αq i − q∈P h1, αq i 2 w.r.t. α (20) s.t. 0 ≤ h1, αp i ≤ CP , p ∈ P, α ≥ 0, where the vector 1 has all components equal to 1 and its dimension is fit to the corresponding αi . The first observation we can make is that there is no coupling constraint between the components of α, all of them occur in one and only one constraints. The second observation is that since the objective function is a quadratic one the gradient of its objective is linear in α, furthermore the objective is a sum of terms which contain no common components of α. These observations can lead us to an algorithm which can exploit the independence of the constraints and the terms of the objectives. Because the partitions of the dual variables are distinct in the objective and the constraints they provide the interactions among the parts of the entire optimization problem.

9.2

The general schema of conditional gradient method

The conditional gradient method is a simple gradient descent method suited to constrained problems. Let us consider the problem with quadratic objective function, in a general form, and assume the constraint is a linear, polyhedral one. def 1 0 2 z Qz

min f (z) = w.r.t. z s.t. z ∈ Z,

− q0 z (21)

where Q a positive definite matrix q is an arbitrary vector and Z covers the linear constraint. Based on the first order optimality condition and assuming z∗ is an optimum solution for the problem (21) then the inequality ∇f (z∗ )(z − z∗ ) ≥ 0 23

(22)

has to hold for any z satisfying z ∈ Z. The basic idea exploited in this algorithm is to find a feasible solution which minimizes (22) at an approximation of the optimum solution and hence get closer to the real optimum. The advantage of this simple approach is that if Z covers a polyhedral constraint then independently from the structure of f () we need to solve only a linear optimization problem in every approximation step. In our case the gradient is ∇f (z) = Qz − q. The schema of the algorithm is given by the following steps Step 0 Let z0 be an initial solution, z > 0 be expected accuracy and t = 0. Step 1 Compute the gradient ∇z f (z)|z=zt for which we use the short notation ∇f (zt ). Step 2 Solve the linear programming problem min ∇f (zt )0 (z − zt ) w.r.t. z s.t. z ∈ Z.

(23)

Let the optimum solution be denoted by z∗ Step 3 Compute the next approximation of the solution using zt+1 = zt + τ (z∗ − zt ),

(24)

where the step size τ ∗ is derived by line search, the details of this computation will be presented below. This approximation can be expressed as convex combination of the previous and new solution zt+1 = (1 − τ )zt + τ z∗ ,

(25)

where the step size satisfies the constraint τ ∈ (0, 1). Step 4 If ||∇f (zt+1 )|| ≥ −z holds Then Stop! Else set t = t + 1 and go to Step 1 The line search for τ can be carried out by solving a uni-variable unconditional optimization problem 1 0 0 min 2 [zt + τ (z∗ − zt )] Q[zt + τ (z∗ − zt )] + q [zt + τ (z∗ − zt )] w.r.t. τ.

(26)

Let ∆zt = z∗ − zt then the optimum for τ can be computed by τ =−

[Qzt + q]0 ∆zt ∇f (zt )0 zt =− 0 0 ∆zt Q∆zt ∆zt Q∆zt 24

(27)

To solve efficiently our problem, (20), we need to find a simple solution schema to evaluate the optimum solution of the linear sub-problem given by (23), to the computation of the gradient and to the formula (27) expressing the optimal τ for the line search. The fast computations of the two latter ones relies on an efficient evaluation of the general matrix-vector product. 9.2.1

Casting the concrete problem into the general schema

In the solution of the dual problem (20) we have the following setting • The variable optimized is given by α • The matrix Q is equal to a block diagonal matrix K where the diagonal sub-matrices are equal to Ku for each u ∈ U. • The vector q is equal to the vector 1 whose all components are equal to 1 and it is of the space Rcard(D) . • The feasibility domain is given by Z = {α|0 ≤ h1, αi i ≤ CI , i ∈ I and α ≥ 0}

9.3

(28)

Solution of the linear subproblem

In solving the linear subproblem (23) one can observe that the objective ∇f (zt )0 (z − zt ) can be written as ∇f (zt )0 z since the term ∇f (zt )0 zt is constant at a fixed t. Now we can write up the linear subproblem (23) of the dual problem (20) based on the gradient  P P ∇α 21 u∈U hαu , Ku αu i − u∈U h1, αu i (29) = (Ku αu − 1|u ∈ U), which is a concatenation of the edge wise gradients, and then the subproblem reads as P min u∈U (Ku α(t)u − 1)αu w.r.t. α (30) s.t. 0 ≤ h1, αi i ≤ CI , i ∈ I, α ≥ 0, where the superscript (t) counts the steps in the conditional gradient iteration. Recall that the constraints are independent and because of the linearity of the objective function, the problem in (30) can be cut into further subproblems belonging to each of the index i. Let the vector of the components of the gradient corresponding to the dual variables of αi be (t) denoted by ∇fi . Now we need to solve for all i the following problems (t)

min ∇(fi )0 αi w.r.t. αi s.t. 0 ≤ h1, αi i ≤ CI . 25

(31)

The optimum solution α∗i of this problem can be computed by a very simple way with linear complexity. Proposition 8. The optimum solution of (31) is equal to ( (t) (t) CI if j = arg mink ∇(fi )k and (fi )j < 0, ∗ (αi )j = 0 otherwise.

(32)

Proof. Since we have a minimization problem and all the components of αi have to be nonnegative, thus if all components of the corresponding part of the gradient are nonnegative then the optimum solution is that which contains only 0 components. Now, assume that (t)

min ∇(fi )k < 0. k

(33)

and let j an index such that (t)

j ∈ arg min ∇(fi )k . k

(34)

(t)

Note that the gradient ∇fi might have several components containing its minimum value, but we choose only one of them. The correctness of the optimum solution can be proved by the following way. First α∗i is a feasible solution since it is nonnegative and the sum of its components is equal to CI . It achieves the minimum because P

(t)

∇(fi )k (αi )X k (t) ≥ mink ∇(fi )k (αi )k k

| k {z }



(35)

≤CI (t) mink ∇(fi )k CI ,

where in the last step the negativity of the minimum is exploited. The critical part of the computation is to find the minimum of the components of the gradient, which can be computed in linear time in the number of components of the gradient..

10

Implementation

This Section gives a short description of the implementation of the theory and how the prediction results are computed. For this implementation Matlab was used because of his advantages in handling huge matrices and vectors. They are no technical details, the Matlab code is open source and available.

26

Figure 4: Parameter validation and testing procedure

10.1

Parameter validation and testing procedure

At the start of the procedure the data is split in a future and a past part as shown in Figure 4. The past part is split again to seperate in a validation training and a validation training part. With this two parts the parameter selection for the future (test) part is done. A parameter is chosen and tested with the validation test part, after that the parameter is changed and the same procedure is started again.

10.2

Programme flow

In this Section the procedure itself is described, how the different program parts interact, and what their purpose is. At the start the data is loaded as described in Section 10.1 in the preload function the data is rawly loaded and in the data select function the data is split as shown in Figure 4 to enable easier handling for the test and training sets. After the data splitting the first validation is done with start parameters, and then the test is run on the test part of the validation data. The results for the best parameters of the validation are saved and the validation function is called again with new parameters. If the results for the new parameters are better than before the best parameters are replaced by the new best parameters, and so on. After 27

Figure 5: Programme flow the computation of the best parameters, the results are computed with the evaluation function.

11

Results

In this section the results of this Thesis will be presented. This will happen in two different ways. For simple win or lose prediction on the one hand and for playing against betting companies odds on the other.

11.1

Measuring the accuracy of the binary prediction

ROC in general means receiver operator characteristic and is a graphical plot of the true positive and true negative rate of a binary classifier. And it will be used to graphically show the results of the Thesis for {+1, −1} prediction. Example 9 shows the corresponding rates for true positive (TP), true negative (TN), false positive (FP) and false negative (FN) for 100 real positives and 100 real negatives.

28

p’ prediction outcome

n’ total

p True Positive False Negative P

actual value n False Positive True Negative N

total P’ N’

Table 1: Receiver operator characteristics The true positive rate or recall is computed as follows: TPR =

TP TP + FN

FPR =

FP FP + TN

ACC =

TP + TN P +N

the false positive rate:

the accuracy:

precision: TP TP + FP and the F1 measurement: 2 ∗ precision ∗ recall precision + recall Example 9 shows based on the framework presented by Table 1 a case with a set of 100 positives and 100 negatives. Example 9. T P = 63 F P = 28 91 F N = 37 T N = 72 109 100 100 T P R = 0.63 F P R = 0.28 ACC = 0.68 for the particular results the accuracy measure will be used . As you can see in Figure 6 the accuracy of the prediction was between and 61 % and 72 % tended to go up the longer we were predicting.

29

Figure 6: Accuracy of the prediction with 3 month calculation period

Figure 7: Accuracy of the prediction with 6 month calculation period

30

Figure 8: Money earned in betting without considering the fee of the betting vendors for 3 month calculation period

11.2

Measuring the prediction of the odds

If you do sports prediction normally the goal is to win money if you place bets on the matches. In this part the results of the prediction are presented in respect to betting, as described in Section 3. To simply measure the prediction of the odds it’s probably the best way just to use the money you earn by match. So the result tables here show the money you can get per 1 unit (e.g. Euro). The results can bee seen in Figures 8 to 11. There are two different calculation methods, one with shifting training and test 3 month into the future after every run and one with shifting 6 month into the future, there is for each method a calculation with betting fee (see Section 3) and without betting fee. If the set result is missing or the information about the outcome cannot be retrieved no win and loss is calculated, therefore the earning is zero. The evaluation procedure of the earning can be seen in Figure 12. To compare the results from with the results betting companies could achieve we created a feature that shows how betting companies are able to bet against each other. Table 13 shows the companies playing against each other with fee, without fee and the accuracy they achieve.

31

Figure 9: Money earned in betting with considering the fee of the betting vendors for 3 month calculation period

Figure 10: Money earned in betting without considering the fee of the betting vendors for 6 month calculation period

32

Figure 11: Money earned in betting with considering the fee of the betting vendors for 6 month calculation period

Figure 12: Evaluation of betting odds

33

Figure 13: Table showing how betting companies would gain against each other

12

Conclusion and Discussion

We introduced a method to predict the outcome of tennis matches, the winner and the corresponding odds, the chance of each player to win. To this end we apply the maximum margin based regression approach. It is possible to apply the introduced method for other ranges of the sport events, like Soccer, Basketball and many other ones. The method is not only applicable for sports, but could probably do well with other fields like matchmaking in onlinegaming or probably in financial problems, because they can also be formulated as games. We were able to provide a competitive result in predicting the odds of the games against betting vendors, see Tables 13. For future work it could be interesting to have a look at the stock market itself, and focus on the currencies because they very much depend on each other 1.1.

34

Symbol A a K X Y Hφ Hψ W h., .iHz , k.kHz tr(W) dim(H) x1 ⊗ x2

hA, BiF kAkF A·B A0 , a0

Explanation matrix A, vector a, constant K, space of the input objects, space of the output objects, Hilbert space comprising the feature vectors, the images of the input vectors with respect to the embedding φ(), Hilbert space comprising the image of label vectors with respect to the embedding ψ() matrix represented linear operator projecting the feature space Hφ into Hψ , inner product and norm defined in the Hilbert space Hz , trace of the matrix W, dimension of the space H. tensor product of the vectors x1 ∈ H1 and x2 ∈ H2 and it represents a linear operator A : H2 → H1 which acts on a vector z ∈ H2 as def (x1 ⊗ x2 )z = (x1 x02 )z = x1 hx2 , ziH2 . Frobenius inner product of matrixPrepresented linear operators A and P 0 B and it is defined by tr(A B) = i j Aij Bij . Frobenius norm of a matrix represented linear operator A and defined p by hA, AiF . element-wise(Schur) product of the matrices A and B, transpose of any matrix A or any vector a. Table 2: Notation used in the paper

13

Appendix

Table 2 contains the most of the symbols used in this Thesis and their meaning.

35

References [1] Elo rating system. http://leagueoflegends.wikia.com/wiki/Elo_ rating_system. [2] Matchmaking. http://na.leagueoflegends.com/learn/gameplay/ matchmaking, 2010. [3] D.P. Bertsekas. Nonlinear Programming. Athena Scienctific, second edition, 1999. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [5] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000. [6] Ralf Herbrich and Thore Graepel. Trueskill(tm): A bayesian skill rating system, 2006. [7] Ghazanfar M.A., Prugel-Bennett A., and Szedmak S. Kernel mapping recommender system algorithms. Information Sciences Journal, 2011. Accepted. [8] Ghazanfar M.A., Szedmak S., and Prugel-Bennett A. Incremental kernel mapping algorithms for scalable recommender systems. In IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Special Session on Recommender Systems in e-Commerce (RSEC). 2011. [9] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Effcient algorithms for maxmargin structured classification. In Predicting Structured Data, pages 105–129. 2007. [10] J. Rousu, C.J. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernelbased learning of hierarchical multilabel classification models. Journal of Machine Learning Research, Special issue on Machine Learning and Large Scale Optimization, 2006. [11] J. Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [12] S. Szedmak, Y. Ni, and S. R. Gunn. Maximum margin learning with incomplete data: Learning networks instead of tables. Journal of Machine Learning Research, Proceedings, 11, Workshop on Applications of Pattern Analysis:96–102, 2010. jmlr.csail.mit.edu/proceedings/papers/v11/szedmak10a/szedmak10a.pdf.

36