DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the...

46
UNIVERSITY OF BRISTOL DEPARTMENT OF ENGINEERING MATHEMATICS SUCCESSFUL FOOTBALL PLAYERS Liam White (Engineering Mathematics) Project thesis submitted in support of the degree of Bachelor of Engineering Supervisor: Dr Filippo Simini, Engineering Mathematics April 2015

Transcript of DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the...

Page 1: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

UNIVERSITY OF BRISTOL

DEPARTMENT OF ENGINEERING MATHEMATICS

SUCCESSFUL FOOTBALL PLAYERS

Liam White (Engineering Mathematics)

Project thesis submitted in support of the degree of Bachelor of Engineering

Supervisor: Dr Filippo Simini, Engineering Mathematics April 2015

Page 2: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

EMAT33800

SUCCESSFUL FOOTBALL PLAYERS

April 27, 2015

Project Report submitted in support of

the degree of Bachelor of Engineering

Liam White

University of Bristol

Engineering Mathematics

[email protected]

Page 3: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Abstract

This paper makes use of statistical analysis, data mining and machine

learning tools to quantitatively determine the performance of football

players in the English Premier League (EPL).

Using data from Squawka, originally sourced from football analysis com-

pany Opta, as well as player ratings from Sky Sports, we collate the events

performed by each player and the ratings for each player over the 2013/14

EPL season. The relationship between the events performed by each player

and the rating can be examined using machine learning algorithms, which

can be used to predict a player’s performance rating, based on events

performed by that player. This approach di↵ers to the traditional points

based system of ranking players.

Examining the relationship between a team’s final points score in the

EPL, versus the average rating of each player in the team over the season,

highlights teams that have underperformed and overperformed relative

to the average performance of their players. We hypothesise that this

overperformance could indicate either the presence of a ‘star player’ in the

team, or more simply, tactical astuteness.

The findings could be useful in the development of quantitative meth-

ods for accurately determining the performance of football players, which

would have wide-ranging implications.

1

Page 4: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Contents

1 Introduction 3

2 Data Description 6

3 Statistics 9

3.1 Di↵erentiating Top Performers in Each Position . . . . . . . . . . 9

3.2 Best Players in Each Position . . . . . . . . . . . . . . . . . . . . 11

3.3 Link between Individual and Team Performance . . . . . . . . . 12

4 Results 15

4.1 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Clustering by Events . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 Clustering by Spherical Coordinates . . . . . . . . . . . . 23

4.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Individual Matches . . . . . . . . . . . . . . . . . . . . . . 30

5 Discussion and Conclusions 33

Appendix A Feature Selection 38

Appendix B Relationships Between Features 39

Appendix C Di↵erentiating Top Performing Players 41

2

Page 5: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

1 Introduction

Despite the wealth of data that exists on the subject, the evaluation of a

football player’s performance remains largely qualitative. Our opinions on the

e↵ectiveness of a player are largely derived from that of pundits and journalists,

with little weighting given to quantitative methods. The statistics that are

examined rarely extend beyond goals, assists and passes, with few analysts

seeking to rank players based on multiple characteristics to develop a quantitative

player rating.

That said, some quantitative ranking systems do exist, with ‘fantasy football’ [1]

perhaps being the most famous. However, while the game is highly popular, this

ranking system is rarely referred to outside of the context of the fantasy football

game.

Another data-driven ranking system is that of Squawka [2]. The company

specialises in football statistics and their player rating system, like fantasy

football, is a points-based system where each event in a match is assigned a

points score. However, their algorithm has far greater complexity; every event

has a di↵erent score depending on the nature of the event and other parameters

such as its outcome and its location on the pitch, which is split into thirteen

separate zones. Despite its complexity and depth of analysis, however, this is

not something that is used widely in mainstream football analysis and punditry.

The use of statistics is far more prevalent in sports in the United States, where

teams and fans alike are more interested in deeper statistical analysis. Football

also appears to lag behind the more structured, play-by-play sports, namely

american football, cricket, tennis and baseball in its use of data; for the latter,

this was widely publicised in Michael Lewis’ critically acclaimed book Moneyball

[3], which was later adapted into a screenplay. These sports consist of discrete

events that can be analysed in isolation, whereas football, as a continuous game,

has more complex interactions, subject to more variation, which are therefore

more di�cult to classify and categorise. Simply put, in football, no two sequences

3

Page 6: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

of play are the same.

The aim here is to develop a method of quantifying a player’s performance

through analysis of a variety of statistics, known as ‘events’, in each game of the

2013/14 EPL season. However, where this study di↵ers to Squawka and other

football data analysis projects, is in the fact that we are examining machine

learning concepts, as opposed to using a points system, in quantifying the

performance of football players.

We focus initially on feature selection, aiming to find the features with high

variance, that best di↵erentiate between players and as such provide the strongest

grounding for classifying the performance of one player versus another.

Initially, we try an unsupervised approach, k-means clustering. As no ground

truth exists for a player’s performance rating, we first attempt to identify the

positions of each player, based on a player’s events over the season, in order

to test the method. We use four clusters, one for each position (goalkeepers,

defenders, midfielders and forwards). Following this, we attempt to identify a

performance rating based on a player’s events using k-means.

We also look at a supervised approach, using the k-nearest neighbours algorithm.

This involves training a classifier with a labelled training dataset and using

this classifier to predict the label of each player in the test dataset. As with

the unsupervised approach, this can be used to identify a player’s position or

performance rating.

Analysing the performance of teams relative to that of individual players within

the team can yield some interesting results. The performance of some teams

is highly correlated to the performance of individuals, whereas other teams

significantly outperform or underperform the average rating of their individuals:

we hypothesise that outperformance could be either due to the presence of a key

player in the team, or tactical astuteness.

More accurate, data driven classification of player’s performance would have

deep implications in football. Most obviously, players and teams can review

4

Page 7: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

performance in more depth, assessing areas for improvement. The experience for

fans can also be improved, with more statistics and analysis giving rise to more

topics for discussion and better informed debate.

There are applications in analysing the style of play of both players and teams,

where successful and unsuccessful styles can be determined. There are also highly

significant implications in the transfer market. With the amount of money in the

sport continually rising and expensive transfer targets often not performing as

predicted, the opportunity to better analyse tranfer targets and identify players

that fit into a certain style of play is something that all clubs would benefit from.

5

Page 8: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

2 Data Description

Two datasets are analysed in this paper. The first is from Squawka [2] and

contains event data for every player in every match in the 2013/14 EPL season.

The second dataset is Sky Sports player ratings data, a control which assists in

determining key events that best quantify a player’s performance and enables

the training of a performance classifier.

The Squawka dataset is extremely rich and separates the events that occur in

each match into a variety of di↵erent categories. Table 1 shows the categories of

events and sub events in the data. Many permutations and di↵erent combinations

of sub events exist. For instance, a pass could be an assist and a long ball or a

headed through ball.

Event Type Action Type Event Subtype

All passes Possession Assist, long ball, through ball, headed

Cards - Red, yellow

Clearances Defence Headed

Corners - Swerve inward, swerve outward, assist

Crosses - Assist

Fouls - -

Goalkeeping Defence Catch, clearance, save

Goal attempts AttackHeaded, shot, blocked, goal, swerve left

swerve right, o↵-target, saved

Headed duels Possession -

Interceptions Defence -

Tackles Defence -

Takeons Attack -

Table 1: All events and sub-events in the Squawka dataset

This dataset is organised into individual matches, within which there is a player

list and an event list, as well as the date of the match and the teams involved

6

Page 9: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

(home and away). Each individual event has a type and specifies the player

involved, his team, the match and the time at which the event occurred. The

data is incredibly rich, taking into account 592,734 events from 620 di↵erent

players in all 380 matches.

The Sky Sports player ratings dataset also contains data from each match in

the 2013/14 EPL season. Within each individual match, a ‘player rating’ and

‘user rating’ is given for each player. The ‘player rating’ is determined by the

Sky Sports pundits and the ‘user rating’ voted upon by fans and users of the

Sky Sports website [4].

Figure 1 shows the distribution of the average Sky Sports player rating for all

players over all games in the season. The mean Sky Sports player rating is

6.31, with a standard deviation of 0.98. Approximating to a normal distribution

suggests that 95% of players have a rating between 4.35 and 8.27.

Figure 1: Histogram showing the distribution of Sky Sports player ratings

Similarly, Figure 2 shows the distribution of the average Sky Sports user rating

for all players over all games in the season. The mean Sky Sports user rating is

slightly lower at 5.80, but with a standard deviation of 1.25. This higher variance

is to be expected, given that the user ratings are determined by a public vote.

Again, approximating to a normal distribution suggests that 95% of players have

7

Page 10: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

a rating between 3.30 and 8.30. This is a broader range, but note that this range

is shifted to the left, towards the lower ratings, suggesting that fans are harsher

critics than pundits. This could also be somewhat due to the fact that users’

votes can be biased towards their favourite team.

Figure 2: Histogram showing the distribution of Sky Sports user ratings

8

Page 11: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

3 Statistics

3.1 Di↵erentiating Top Performers in Each Position

In order to e↵ectively classify the performance of players, it is useful to examine

the events that di↵erentiate the top players from the rest. Here, we compare the

10% of players in each position, to the rest of the players in the same position.

The rankings of players, for this purpose, are determined by the Sky Sports

player ratings and averaged over all games in the season, for each player. We

examine the distributions of individual events for all players in each position,

compared to the top 10% of players in each position.

Figure 3 compares goals scored by all midfielders, with goals scored by the top

10% of midfielders. While there is not a perfect correlation, there is certainly a

relationship between goals scored and performance rating for midfielders. For

all midfielders, the mean goals scored over the season is 2.24; for the top 10%,

it is 5.81. In the top 10%, there are also only two players who have scored less

than three goals, with 50% of players scoring 6 goals or more.

Figure 3: Distribution comparing goals scored by all midfielders and the top

10% of midfielders

Figure 4 shows goals scored by all forwards versus goals scored by the top 10%

of forwards. There is a clear link between goals scored and performance rating

for forwards. The mean number of goals scored, over the course of the season,

9

Page 12: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

for all forwards, is 5.53. No forward in the top 10% scored less than 9 goals,

more than one and a half times the mean for all forwards.

Figure 4: Distribution comparing goals scored by all forwards and the top 10%

of forwards

Figure 5 compares clearances made by all defenders with the top 10% of defenders.

The link between clearances and performance rating for defenders is less obvious

than the previous relationships between goals scored and performance rating.

While the mean clearances is higher for the top 10% of defenders, 162 versus 113

for all defenders, 50% of top performing defenders made less than 120 clearances.

Figure 5: Distribution comparing clearances made by all defenders and the top

10% of defenders

The relationships between two further features and performance rating, for

10

Page 13: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

defenders, midfielders and forwards are represented in Appendix C. This analysis

does not extend to goalkeepers, since there are fewer relevant features for

analysing the performance of goalkeepers.

3.2 Best Players in Each Position

Figure 6 shows a radar graph, comparing the top midfielder and forward to the

average in each position respectively. The top midfielder last season, again by

Sky Sports player ratings, was David Silva. The top forward was Luis Suarez.

The radar graph gives an indication of the areas in which these top player excel

compared to the average in each position. The features are normalised between

zero and one, enabling the radar graph to give an indication of the relative

magnitudes of features. In midfield, the key areas in which top player David

Silva excels are passes and chances created, where he significantly exceeds the

average, followed by goals and shots. In the forward position, Suarez scored the

most goals and took the most shots and created significantly more chances than

the average forward.

Figure 6: Radar graph comparing the highest rated forward and midfielder to

the average in each position

Figure 7 shows the radar graph comparing last season’s top defender, Luke Shaw,

with the average defender. Shaw made the most tackles of all defenders last

11

Page 14: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

season and made significantly more passes than the average defender. Besides

these key areas, he slightly exceeded the average defender in chances created,

clearances and interceptions.

Figure 7: Radar graph comparing the highest rated defender to the average

defender

3.3 Link between Individual and Team Performance

Figure 8 shows the relationship between Sky Sports player rating, averaged

over each player in every team for all matches in the season, and final team

points at the end of the season. This can be seen as a proxy for whether a team

underperformed or outperformed over the course of the season. The trend line

on Figure 8 shows the expected final team points, based on average Sky Sports

player rating. A team below the trend line is deemed to have underperformed

relative to their average player performance, whereas a team above the trend

line is deemed to have scored more points than would be expected, given the

team’s average Sky Sports player rating.

12

Page 15: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 8: Graph showing the link between average player rating and total team

points in the 2013/14 EPL

We hypothesise that two possible explanations may exist for a team significantly

outperforming relative to average player rating. The first explanation is the

presence of a ‘star player’ who significantly outperforms all other players in the

team. The second, quite simply, is tactical astuteness, enabling teams to perform

beyond the sum of its parts.

The three teams with the largest Euclidian distance above the trend line are

Tottenham, Manchester United and Liverpool. Looking at the standard deviation

of Sky Sports player ratings within each team gives an indication as to the spread

of player performance within the team; the presence of a key player could increase

standard deviation. The mean of the standard deviations across all teams is

0.97. The standard deviation for Tottenham is also 0.97, however the figures for

Manchester United and Liverpool are 1.03 and 1.07 respectively, considerably

exceeding the mean. Ranking teams by standard deviation puts Liverpool 3rd,

Manchester United 5th and Tottenham 8th out of the twenty teams in the league.

We take a closer look at the distribution of average ratings for the teams in

question. For Liverpool, Luis Suarez has an average player rating of 7.60, which

significantly exceeeds that of second ranked player Gerrard, whose player rating

is 7.21. This di↵erence of 0.39 rating points is the largest di↵erence between any

two player ratings at the club.

13

Page 16: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Moving on to Manchester United, Wayne Rooney is the highest ranked player,

with an average player rating of 7.31, which significantly exceeds that of second

ranked player Adnan Januzaj, who has an average player rating of 6.96. This

di↵erence of 0.35 points is the second largest di↵erence between any two player

ratings at the club.

Finally, we look at Tottenham. Joint highest ranked players are Christian Eriksen

and Emmanuel Adebayor, with ratings of 7.0. These ratings are significantly

higher than that of third ranked player Paulinho, who averages 6.61. This

di↵erences of 0.39 points are the largest di↵erences in rating between any two

players at the club.

These results, for the three clubs with the highest outperformance relative to Sky

Sports average player ratings, provide strong evidence to suggest that significant

outperformance is explained by the presence of a ‘star player’ in the team,

or in the case of Tottenham, two ‘star players’ in the team, who significantly

outperform their team mates.

14

Page 17: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

4 Results

4.1 Feature Analysis

With a huge variety of events present in the data, as shown previously in Table

1, some initial intuition is necessary to broadly decide which events to examine.

Then, from this initial subset of events, redundant features can be eliminated and

the best features, which give the most information, retained. Feature analysis

involves choosing the most appropriate subset of features in order to build a

model to make predictions; in this case, on a player’s performance.

Whilst the aim is to classify performance, there exists no ground truth player

performance rating. Therefore, it is necessary to test methods on another

characteristic, for which a ground truth does exist, in order to compare the

classifier’s prediction to known values and determine the e↵ectiveness of a method.

For this purpose, we use a player’s position.

Before selecting features, however, it is necessary to pre-process the data. Figure

9 shows the distribution of events per player, for all players. Clearly, there are a

large number of players that have performed relatively few events. These are

the players that appear largely as substitutes, rarely starting games. There is

not enough data to classify these players and it is necessary to remove them, as

their presence could increase inaccuracies in the classifier.

Figure 9: Histogram showing the distribution of events per player

15

Page 18: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

On average, 27 players players play in each match. This corresponds to 22

starters (11 per team) and five substitutes; each team uses on average 2.5 out

of a maximum of three substitutes. The bin highlighted in red in Figure 9 is

the tail that has been removed, and this corresponds roughly to 527 of the data:

the substitutes. Before eliminating the substitutes, the mean number of events

per player is 1958, after elimination, the mean number is 2404, illustrating the

manner in which substitutes skew the data.

We choose a broad group of features, before using feature selection to find an

appropriate subset. This group of features consists of goals, shots, chances

created, successful passes, successful tackles, clearances, interceptions and saves.

In order to determine the features which best characterise a player’s position,

we look at each event individually and take a histogram of the frequency of

occurrence of each event for the players in each position (goalkeepers, defenders,

midfielders and forwards), separately.

Joining the peaks of each bin to create a line graph and overlaying the plots for

each position gives a visual representation of how each feature varies between

players in each position. Better features have less overlap between positions and

can distinguish between groups of players e↵ectively.

Since the distributions of events in each position vary considerably, it is di�cult to

approximate to a particular probability distribution in order to use a quantitative

rule, such as a distance threshold between distributions, for feature selection.

For this reason, combined with the fact that the events distributions can easily

be analysed visually, we use qualitative feature analysis.

Figures 10 shows the distribution of clearances for players in each position and

this is an example of a good feature. The lines for each player, representing the

peaks of each bin in the histogram, are distinct with no overlap, indicating that

the feature di↵erentiates between players e↵ectively. Similarly, Figure 11 is an

essential feature, since this distinguishes goalkeepers from all other positions.

16

Page 19: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 11: Distribution of saves for players in each position

Figure 10: Distribution of clearances for players in each position

Figure 12, the distribution of successful tackles, is an example of a feature which

gives less information. There are overlaps between all outfield players and as

such, the feature is less likely to assist in di↵erentiating between players in

di↵erent positions.

17

Page 20: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 12: Distribution of successful tackles for players in each position

The graphs for some of the remaining events are shown in Appendix A. Based

on these, we decide that the best features are passes, goals, saves, clearances,

chances created and interceptions. Here, we look at the results of the clustering

using the selected features.

4.2 Unsupervised Learning

4.2.1 Clustering by Events

Initially, we look at using an unsupervised learning approach to clustering. This

involves using the features to find clusters of similar players in the data, without

using any training labels.

It is necessary to normalise all the features, as the magnitudes vary significantly

for each feature. For example, passes has the largest magnitude by some distance

for most players, ranging from 64 to 1870. Goals, however, varies from 0 to 31.

Rescaling ensures that an approximately equal weight is placed on each feature.

We use k-means for the unsupervised cluster analysis. K-means aims to split

n data points into k clusters, by minimising the within-cluster sum of squared

distances between each data point and the mean of the cluster. The k-means are

18

Page 21: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

initialised, often to random values and then the algorithm proceeds iteratively,

in two-steps. Each data point is assigned to the cluster that minimises the

within-cluster sum of squares and in the second step, the means are updated

according to the new cluster membership. When the cluster assignment ceases to

change, the algorithm terminates, having reached a local optimum configuration

[5].

K-means takes the number of clusters as a parameter and here we use four

clusters, since there are four possible positions. Analysing the membership of

the k-means predicted clusters with the known positions shows that players are

not clustered distinctly by position, however analysing the membership of each

cluster yields some interesting results.

Figure 13 shows the distribution of player positions in each cluster (1,2,3,4).

Cluster 3 is made up entirely of goalkeepers, as would be expected, since one of

the features is saves and this is zero for any non-goalkeepers. This cluster has

23 members. Cluster 4 contains a heterogeneous mixture of all positions and

has a membership of 199 players. However cluster 1 and 2 are most interesting,

since cluster 1 contains predominantly forwards and midfielders and cluster 2

contains predominantly defenders and midfielders. Cluster 1 has a membership

of 93 players and cluster 2 has a membership of 111.

19

Page 22: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

(a) Cluster 1 (b) Cluster 2

(c) Cluster 3 (d) Cluster 4

Figure 13: Distribution of player positions within each cluster

Whereas previously the midfield used to be a fairly distinct position, with most

teams playing a 4-4-2 formation, this is no longer the case in the modern game.

Now, midfielders are largely classified as defensive midfielders who sit in front

of the defence, or attacking midfielders who interact closely with the strikers,

20

Page 23: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

with many teams adopting a 4-2-3-1 formation, with two defensive and three

attacking midfielders. In fact, some analysts group ‘traditional strikers’ and

attacking midfielders together as forwards, indicating the increasing fluidity in

the midfielder’s role.

To illustrate this, Table 2 splits the midfielders in clusters 1 and 2 into defensive

and attacking midfielders. This is determined by assessing the formations and

starting positions from Sky Sports [4] for each player in every team over the

course of the season. A player who starts the majority of matches in a holding

midfield role is classified as a defensive midfielder, whereas an attacking midfielder

is one who starts the majority of matches in an attacking midfield role.

As expected, cluster 2 contains 20 defensive midfielders, with just 3 attacking

midfielders. Cluster 1 contains 17 attacking midfielders and 6 defensive mid-

fielders. This confirms the dynamic nature of the role of the midfielder in the

modern game.

21

Page 24: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Cluster 2 (Sample) Cluster 1 (Sample)

Midfielder Emphasis Midfielder Emphasis

Arteta Defensive Barkley Attacking

Bacuna Attacking Bolasie Attacking

Barry Defensive Willian Attacking

Britton Attacking Brunt Attacking

Carrick Defensive Chadli Attacking

Delph Defensive Cleverley Defensive

Elmohamady Attacking Cork Defensive

Gerrard Defensive Dyer Attacking

Henderson Defensive Howson Attacking

Huddlestone Defensive Kagawa Attacking

Jedinak Defensive Kim Bo-Kyung Attacking

Leiva Defensive Lallana Attacking

Fernandinho Defensive Meyler Attacking

McCarthy Defensive James Morrison Attacking

Medel Defensive N’Zonzi Defensive

Mulumbu Defensive Nolan Attacking

Noble Attacking Osman Attacking

Parker Defensive Ramires Defensive

Richardson Defensive Silva Attacking

Schneiderlin Defensive Townsend Attacking

Sidwell Defensive Whittingham Attacking

Tiote Defensive Wilshere Defensive

Whelan Defensive Yacob Defensive

Table 2: Playing style of midfielders in clusters 1 and 2

22

Page 25: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

4.2.2 Clustering by Spherical Coordinates

In addition to this method of clustering by using a set of features for each player,

another option is to cluster according to spherical coordinates: the ’angles’

between events. This corresponds to a change of coordinates in the feature space,

from Cartesian to spherical coordinates, so that each player is characterised by

the angles between the player’s features vector and the axes. Appendix B shows

some of the relationships between di↵erent features (absolute and angular).

Converting from the Cartesian system involves calculating the radius, r, which

is the Eucleidian distance from the origin to each point (one for each player) in

the feature space, as shown in (1), where x1 ... xn are the Cartesian coordinates

of each feature. The angular coordinate, �, is defined in (2)...(5), where for n

Cartesian coordinates, there exist n� 1 �.

r =pxn

2 + xn�1 + ...+ x22 + x1

2 (1)

�1 = arccosx1p

xn2 + xn�1

2 + ...+ x12

(2)

�2 = arccosx2p

xn2 + xn�1

2 + ...+ x22

(3)

...

�n�2 = arccotxn�2p

xn2 + xn�1

2(4)

�n�1 = 2arccotxn�1 +

pxn

2 + xn�12

xn(5)

Running k-means on the spherical coordinates derived from all the selected

features, for each player, produces similar clustering to the previous method

(using the feature set), but the results are more pronounced, as shown in Fig.

14.

23

Page 26: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Similarly, there is a cluster that contains solely goalkeepers (cluster 3). Once

again, as before there is a cluster that contains predominantly defenders and

midfielders (cluster 4) and a cluster that contains predominantly midfielders and

forwards (cluster 1). However, with this method, the clusters are more pure -

there is only 1 defender in cluster 1, compared to 75 forwards and 64 midfielders

and just 2 forwards in cluster 4, compared to 44 defenders and 108 midfielders.

This result suggests that the relative frequency between events (as measured by

the spherical angles) rather then their absolute number, is a better quantity to

predict a player’s position.

Whilst the results are largely positive, the clusters are not pure enough to

suggest that this method would be accurate for rating players. As such, we use

a supervised learning algorithm to classify ratings.

24

Page 27: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

(a) Cluster 1 (b) Cluster 2

(c) Cluster 3 (d) Cluster 4

Figure 14: Distribution of player positions within each cluster (angular coordi-

nates)

25

Page 28: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

4.3 Supervised Learning

Supervised learning involves separating labelled data into a training and test

set, using the training set to train a classifier, which can be used to predict a

label for the test set. In this case, we again look at classifying the positions of

players before classifying performance.

Here, we use the k-nearest neighbours algorithm for supervised learning. In

k-nearest neighbours classification, the distance between each test data point

and the k-nearest training data points is calculated. The class label of the test

data point is determined by a majority class vote among its k nearest neighbours

[6].

The number of nearest neighbours is required as a parameter, and we also vary

the size of the training set in order to find the optimum, which best classifies

the test set. In the first case, for classifying player position, the training data

consists of a scaled feature vector for each player, as well as the player position

as a label. Then, once the classifier is trained, given the feature set for the test

data, the classifier can predict the player’s position.

This works similarly for classifying performance, except rather than giving the

player’s position as a training label, the average Sky Sports player rating for

that player, over the course of the EPL 2013/14 season is given as a training

label. For the test set, the classifier predicts the player’s rating based on the

feature set given.

By using supervised learning, we can measure the performance of the classifier,

since we can compare the predicted output against the training data labels. In

the case of player position, we examine whether the classifier predicted position

is the same as the actual position. For performance, if the predicted rating is

within 0.5 points (ratings given out of 10) of the Sky Sports player rating, this

is considered a success, outside this range is considered a fail.

In order to truly test the accuracy of the classifier and to give an idea of whether

it will work on an independent data set, we use cross validation. Here, we

26

Page 29: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

use a loop to shu✏e the data and then split the data into training and test

datasets. This is performed 1000 times in order to create di↵erent training and

test datasets. The model also varies the number of nearest neighbours from 2 to

4 for positions and 1 to 3 for ratings and the size of the training set from 10%

to 50%, in order to optimise the classifier. The results of each permutation are

averaged and displayed graphically.

Error bars are displayed on each of the classification accuracy graphs. Each

time the data set is shu✏ed, the accuracy is stored; the mean is shown on the

accuracy graphs and the error bar shows one standard deviation each side of the

mean. A large standard deviation suggests high sensitivity to the training data.

We start with classifying player positions and Figure 15 shows the accuracy of

the classifier in predicting the positions of each player. Accuracy increases as the

percentage of training data and the number of nearest neighbours are increased.

The optimum accuracy achieved is 82.0%, with 50% training data and three

nearest neighbours.

Figure 15: Classification accuracy for player positions as parameters varied

We can use the same method to classify player performance; Figure 16 shows the

classifier accuracy. As before, the number of nearest neighbours used and the

percentage of training data is varied in order to find the optimum configuration.

Again, a 50% training set produces the best results, but for classifying player

27

Page 30: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

performance, two nearest neighbours is optimum. Crucially, this configuration

of 50% training data and one nearest neigbour has a high success rate of 85.0%.

With just 10% training data and k = 1, a similar level of accuracy is achieved:

82.7%. For the k-nearest neighbours algorithm, when k=1, the variance of

the classifier is highest. The fact that k=1 works successfully in this situation

suggests that the most nearby examples, i.e. players with the most similar

feature sets, have very similar ratings.

Figure 16: Classification accuracy for player ratings as parameters varied

In order to test the significance of the results, we run the classifier again, but

rather than comparing the classifier predicted rating to the actual sky rating, we

reshu✏e the sky ratings and compare. This simulates random guessing. Figure

17 shows the classifier accuracy for random guessing, comparing test data with

shu✏ed sky ratings. Clearly, accuracy is significantly lower than for the classifier

when compared to the actual Sky player rating: 66.3% versus 85.0%. With

random guessing, the standard deviation is also significantly higher: rising from

an average of 3.4 in the actual classification to 9.1, suggesting high sensitivity to

training data.

28

Page 31: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 17: Classification accuracy for randomly shu✏ed Sky ratings

Since cross validation is used, it is reasonable to assume that this method could

work for other leagues, provided there are enough games in a season.

Since using spherical spherical coordinates worked successfully in clustering, we

attempt to use spherical coordinates for classification instead of using scaled

features. The spherical coordinates correspond to the angles between the features,

as opposed to the magnitudes of the features. Figure 18 shows the classification

accuracy using this method. The results are not as accurate as achieved previ-

ously, using scaled features: the maximum accuracy is 75.6% with two nearest

neighbours and 40% training data. This result suggests that the magnitude of

events, rather then the relative frequency between events, is a better quantity to

predict a player’s performance rating.

29

Page 32: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 18: Classification accuracy for player ratings as parameters varied, using

angular coordinates

4.3.1 Individual Matches

In order to test the classifier further, we examine its e↵ect on classifying the

performance of players in one particular match. This is a tough task, since the

variance is significantly higher in an individual match than over the course of a

season. Additionally, with significantly fewer events in a single match than over

the course of a season, event sets for each player are sparse and it is harder to

determine a player performance rating.

Figure 19 shows the performance of the classifier on a match between Manchester

United and Chelsea and Figure 20 shows the classifier’s performance on a match

between Stoke City and Hull City.

30

Page 33: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 19: Classification accuracy for a single match: Manchester United vs

Chelsea

The classifier does not work as e↵ectively for individual matches. Player’s

performances are not consistent throughout a season and the feature set for a

player in an individual match is often rather empty, making classification di�cult.

In the Manchester United vs Chelsea game, the classifier is 52.8% successful

using two nearest neighbours and a 50% training set. In the Stoke City vs Hull

City match, the classifier is 58.0% successful with three nearest neighbours and

a 50% training set.

Figure 20: Classification accuracy for a single match: Stoke City vs Hull City

31

Page 34: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Lastly we examine the e↵ect of relaxing the margin for an accurate classification

to within 1.0 points of the Sky rating (as opposed to 0.5), for the Stoke City vs

Hull City match. Classification accuracy increases to a peak of 88.4% with three

nearest neighbours and either 30% or 50% training data.

Figure 21: Classification accuracy for a single match: Stoke City vs Hull City,

using a 1.0 point interval

The classifier works very well for a large dataset, which takes averages across a

season, but some further tuning is necessary in order for it to work accurately on

individual matches. However, since cross validation has been used, this classifier

should work on an unseen test dataset of player features, over the course of a

season, for another league or for other seasons of the EPL.

32

Page 35: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

5 Discussion and Conclusions

Machine learning techniques o↵er an interesting and viable alternative to the

traditional points based system of ranking football players. Over the course of a

season, k-nearest neighbours can accurately predict a player’s performance rating.

The ability to accurately predict a player’s performance rating has a variety

of useful implications. Fans and pundits can analyse matches more thoroughly

and teams can give better feedback to their players and analyse opponents with

greater accuracy.

However, when analysing a single game in isolation, the performance prediction is

less accurate, since players do not perform uniformly throughout the season and

player ratings in a single match have higher variance than over the course of a

season. This is an area in which further work is needed, since the ability to predict

performance in a single match would enable analysis of individual performances

on a match-by-match basis. This is an advantage that the traditional, points

based scoring system of ranking players has over machine learning.

There are a number of improvements that could be made and next steps that

could be taken, in order to further this analysis. Training the classifier using

another set of player ratings, as labels, would test the classifier’s accuracy further.

Since no ground truth exists for the performance rating of a player, it could be

useful to examine di↵erent player performance ratings for use in training the

classifier.

The use of a quantitative method of feature selection, such as using the Hellinger

or Kolmogorov-Smirnov distance to measure the distance between distributions,

could improve classification accuracy. A qualitative method has been used here,

since it is di�cult to approximate the distributions of events, across di↵erent

positions, to a probability distribution. More accurate feature selection could, in

particular, improve the classifier accuracy when analysing individual matches,

where there is significant room for improvements.

Finally, whilst results obtained from this study are significant, there is a limitation

33

Page 36: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

in that only one league over the period of one season is considered. Cross

validation is used to test the applicability of applying the classifier to other

datasets, however, in order to fully analyse the feasibility of using machine

learning techniques more widely to classify player performance, other seasons,

certainly, and ideally other leagues, must be considered.

34

Page 37: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

List of Figures

1 Histogram showing the distribution of Sky Sports player ratings . 7

2 Histogram showing the distribution of Sky Sports user ratings . . 8

3 Distribution comparing goals scored by all midfielders and the

top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Distribution comparing goals scored by all forwards and the top

10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Distribution comparing clearances made by all defenders and the

top 10% of defenders . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Radar graph comparing the highest rated forward and midfielder

to the average in each position . . . . . . . . . . . . . . . . . . . 11

7 Radar graph comparing the highest rated defender to the average

defender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

8 Graph showing the link between average player rating and total

team points in the 2013/14 EPL . . . . . . . . . . . . . . . . . . 13

9 Histogram showing the distribution of events per player . . . . . 15

11 Distribution of saves for players in each position . . . . . . . . . 17

10 Distribution of clearances for players in each position . . . . . . . 17

12 Distribution of successful tackles for players in each position . . . 18

13 Distribution of player positions within each cluster . . . . . . . . 20

14 Distribution of player positions within each cluster (angular coor-

dinates) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

15 Classification accuracy for player positions as parameters varied . 27

16 Classification accuracy for player ratings as parameters varied . . 28

17 Classification accuracy for randomly shu✏ed Sky ratings . . . . . 29

18 Classification accuracy for player ratings as parameters varied,

using angular coordinates . . . . . . . . . . . . . . . . . . . . . . 30

19 Classification accuracy for a single match: Manchester United vs

Chelsea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

20 Classification accuracy for a single match: Stoke City vs Hull City 31

35

Page 38: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

21 Classification accuracy for a single match: Stoke City vs Hull City,

using a 1.0 point interval . . . . . . . . . . . . . . . . . . . . . . . 32

22 Distribution of goals for players in each position . . . . . . . . . 38

23 Distribution of shots for players in each position . . . . . . . . . 38

24 Distribution of chances created by players in each position . . . . 39

25 Relationship between the magnitudes of passes and goals; colour

coordinated by cluster: cluster 1: dark blue, cluster 2: light blue,

cluster 3: yellow, cluster 4: red . . . . . . . . . . . . . . . . . . . 39

26 Relationship between the magnitudes of saves and tackles; colour

coordinated by cluster: cluster 1: dark blue, cluster 2: light blue,

cluster 3: yellow, cluster 4: red . . . . . . . . . . . . . . . . . . . 40

27 Relationship between angular coordinates for clearances and tackles 40

28 Relationship between angular coordinates for goals and shots . . 41

29 Distribution comparing passes made by all midfielders and the

top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 41

30 Distribution comparing chances created by all midfielders and the

top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 42

31 Distribution comparing chances created by all forwards and the

top 10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . 42

32 Distribution comparing shots taken by all forwards and the top

10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

33 Distribution comparing interceptions made by all defenders and

the top 10% of defenders . . . . . . . . . . . . . . . . . . . . . . . 43

34 Distribution comparing tackles made by all defenders and the top

10% of defenders . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

List of Tables

1 All events and sub-events in the Squawka dataset . . . . . . . . . 6

2 Playing style of midfielders in clusters 1 and 2 . . . . . . . . . . . 22

36

Page 39: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

References

[1] Premier League. (2015). Rules: Scoring. Available: http://fantasy.

premierleague.com/rules/. Last accessed 24th April 2015.

[2] Squawka. (2015). Squawka: Football Stats, Live Match Data and Player

Statistics. Available: http://www.squawka.com/. Last accessed 24th April

2015.

[3] Lewis, M (2004). Moneyball. New York: W. W. Norton & Company.

[4] Sky. (2015). Sky Sports Football. Available: http://www1.skysports.com/

football/. Last accessed 24th April 2015.

[5] MacKay, J (2003). Information Theory, Inference, and Learning Algorithms.

4th ed. Cambridge: Cambridge University Press.

[6] Beyer, K et al. (1999). When Is “Nearest Neighbor” Meaningful?. Database

Theory – ICDT’99. 217-220.

37

Page 40: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Appendix A Feature Selection

Figure 22: Distribution of goals for players in each position

Figure 23: Distribution of shots for players in each position

38

Page 41: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 24: Distribution of chances created by players in each position

Appendix B Relationships Between Features

Figure 25: Relationship between the magnitudes of passes and goals; colour

coordinated by cluster: cluster 1: dark blue, cluster 2: light blue, cluster 3:

yellow, cluster 4: red

39

Page 42: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 26: Relationship between the magnitudes of saves and tackles; colour

coordinated by cluster: cluster 1: dark blue, cluster 2: light blue, cluster 3:

yellow, cluster 4: red

Figure 27: Relationship between angular coordinates for clearances and tackles

40

Page 43: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 28: Relationship between angular coordinates for goals and shots

Appendix C Di↵erentiating Top Performing Play-

ers

Figure 29: Distribution comparing passes made by all midfielders and the top

10% of midfielders

41

Page 44: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 30: Distribution comparing chances created by all midfielders and the

top 10% of midfielders

Figure 31: Distribution comparing chances created by all forwards and the top

10% of forwards

42

Page 45: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 32: Distribution comparing shots taken by all forwards and the top 10%

of forwards

Figure 33: Distribution comparing interceptions made by all defenders and the

top 10% of defenders

43

Page 46: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows

Figure 34: Distribution comparing tackles made by all defenders and the top

10% of defenders

44