DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the...
Transcript of DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the...
![Page 1: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/1.jpg)
UNIVERSITY OF BRISTOL
DEPARTMENT OF ENGINEERING MATHEMATICS
SUCCESSFUL FOOTBALL PLAYERS
Liam White (Engineering Mathematics)
Project thesis submitted in support of the degree of Bachelor of Engineering
Supervisor: Dr Filippo Simini, Engineering Mathematics April 2015
![Page 2: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/2.jpg)
EMAT33800
SUCCESSFUL FOOTBALL PLAYERS
April 27, 2015
Project Report submitted in support of
the degree of Bachelor of Engineering
Liam White
University of Bristol
Engineering Mathematics
![Page 3: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/3.jpg)
Abstract
This paper makes use of statistical analysis, data mining and machine
learning tools to quantitatively determine the performance of football
players in the English Premier League (EPL).
Using data from Squawka, originally sourced from football analysis com-
pany Opta, as well as player ratings from Sky Sports, we collate the events
performed by each player and the ratings for each player over the 2013/14
EPL season. The relationship between the events performed by each player
and the rating can be examined using machine learning algorithms, which
can be used to predict a player’s performance rating, based on events
performed by that player. This approach di↵ers to the traditional points
based system of ranking players.
Examining the relationship between a team’s final points score in the
EPL, versus the average rating of each player in the team over the season,
highlights teams that have underperformed and overperformed relative
to the average performance of their players. We hypothesise that this
overperformance could indicate either the presence of a ‘star player’ in the
team, or more simply, tactical astuteness.
The findings could be useful in the development of quantitative meth-
ods for accurately determining the performance of football players, which
would have wide-ranging implications.
1
![Page 4: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/4.jpg)
Contents
1 Introduction 3
2 Data Description 6
3 Statistics 9
3.1 Di↵erentiating Top Performers in Each Position . . . . . . . . . . 9
3.2 Best Players in Each Position . . . . . . . . . . . . . . . . . . . . 11
3.3 Link between Individual and Team Performance . . . . . . . . . 12
4 Results 15
4.1 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Clustering by Events . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Clustering by Spherical Coordinates . . . . . . . . . . . . 23
4.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Individual Matches . . . . . . . . . . . . . . . . . . . . . . 30
5 Discussion and Conclusions 33
Appendix A Feature Selection 38
Appendix B Relationships Between Features 39
Appendix C Di↵erentiating Top Performing Players 41
2
![Page 5: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/5.jpg)
1 Introduction
Despite the wealth of data that exists on the subject, the evaluation of a
football player’s performance remains largely qualitative. Our opinions on the
e↵ectiveness of a player are largely derived from that of pundits and journalists,
with little weighting given to quantitative methods. The statistics that are
examined rarely extend beyond goals, assists and passes, with few analysts
seeking to rank players based on multiple characteristics to develop a quantitative
player rating.
That said, some quantitative ranking systems do exist, with ‘fantasy football’ [1]
perhaps being the most famous. However, while the game is highly popular, this
ranking system is rarely referred to outside of the context of the fantasy football
game.
Another data-driven ranking system is that of Squawka [2]. The company
specialises in football statistics and their player rating system, like fantasy
football, is a points-based system where each event in a match is assigned a
points score. However, their algorithm has far greater complexity; every event
has a di↵erent score depending on the nature of the event and other parameters
such as its outcome and its location on the pitch, which is split into thirteen
separate zones. Despite its complexity and depth of analysis, however, this is
not something that is used widely in mainstream football analysis and punditry.
The use of statistics is far more prevalent in sports in the United States, where
teams and fans alike are more interested in deeper statistical analysis. Football
also appears to lag behind the more structured, play-by-play sports, namely
american football, cricket, tennis and baseball in its use of data; for the latter,
this was widely publicised in Michael Lewis’ critically acclaimed book Moneyball
[3], which was later adapted into a screenplay. These sports consist of discrete
events that can be analysed in isolation, whereas football, as a continuous game,
has more complex interactions, subject to more variation, which are therefore
more di�cult to classify and categorise. Simply put, in football, no two sequences
3
![Page 6: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/6.jpg)
of play are the same.
The aim here is to develop a method of quantifying a player’s performance
through analysis of a variety of statistics, known as ‘events’, in each game of the
2013/14 EPL season. However, where this study di↵ers to Squawka and other
football data analysis projects, is in the fact that we are examining machine
learning concepts, as opposed to using a points system, in quantifying the
performance of football players.
We focus initially on feature selection, aiming to find the features with high
variance, that best di↵erentiate between players and as such provide the strongest
grounding for classifying the performance of one player versus another.
Initially, we try an unsupervised approach, k-means clustering. As no ground
truth exists for a player’s performance rating, we first attempt to identify the
positions of each player, based on a player’s events over the season, in order
to test the method. We use four clusters, one for each position (goalkeepers,
defenders, midfielders and forwards). Following this, we attempt to identify a
performance rating based on a player’s events using k-means.
We also look at a supervised approach, using the k-nearest neighbours algorithm.
This involves training a classifier with a labelled training dataset and using
this classifier to predict the label of each player in the test dataset. As with
the unsupervised approach, this can be used to identify a player’s position or
performance rating.
Analysing the performance of teams relative to that of individual players within
the team can yield some interesting results. The performance of some teams
is highly correlated to the performance of individuals, whereas other teams
significantly outperform or underperform the average rating of their individuals:
we hypothesise that outperformance could be either due to the presence of a key
player in the team, or tactical astuteness.
More accurate, data driven classification of player’s performance would have
deep implications in football. Most obviously, players and teams can review
4
![Page 7: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/7.jpg)
performance in more depth, assessing areas for improvement. The experience for
fans can also be improved, with more statistics and analysis giving rise to more
topics for discussion and better informed debate.
There are applications in analysing the style of play of both players and teams,
where successful and unsuccessful styles can be determined. There are also highly
significant implications in the transfer market. With the amount of money in the
sport continually rising and expensive transfer targets often not performing as
predicted, the opportunity to better analyse tranfer targets and identify players
that fit into a certain style of play is something that all clubs would benefit from.
5
![Page 8: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/8.jpg)
2 Data Description
Two datasets are analysed in this paper. The first is from Squawka [2] and
contains event data for every player in every match in the 2013/14 EPL season.
The second dataset is Sky Sports player ratings data, a control which assists in
determining key events that best quantify a player’s performance and enables
the training of a performance classifier.
The Squawka dataset is extremely rich and separates the events that occur in
each match into a variety of di↵erent categories. Table 1 shows the categories of
events and sub events in the data. Many permutations and di↵erent combinations
of sub events exist. For instance, a pass could be an assist and a long ball or a
headed through ball.
Event Type Action Type Event Subtype
All passes Possession Assist, long ball, through ball, headed
Cards - Red, yellow
Clearances Defence Headed
Corners - Swerve inward, swerve outward, assist
Crosses - Assist
Fouls - -
Goalkeeping Defence Catch, clearance, save
Goal attempts AttackHeaded, shot, blocked, goal, swerve left
swerve right, o↵-target, saved
Headed duels Possession -
Interceptions Defence -
Tackles Defence -
Takeons Attack -
Table 1: All events and sub-events in the Squawka dataset
This dataset is organised into individual matches, within which there is a player
list and an event list, as well as the date of the match and the teams involved
6
![Page 9: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/9.jpg)
(home and away). Each individual event has a type and specifies the player
involved, his team, the match and the time at which the event occurred. The
data is incredibly rich, taking into account 592,734 events from 620 di↵erent
players in all 380 matches.
The Sky Sports player ratings dataset also contains data from each match in
the 2013/14 EPL season. Within each individual match, a ‘player rating’ and
‘user rating’ is given for each player. The ‘player rating’ is determined by the
Sky Sports pundits and the ‘user rating’ voted upon by fans and users of the
Sky Sports website [4].
Figure 1 shows the distribution of the average Sky Sports player rating for all
players over all games in the season. The mean Sky Sports player rating is
6.31, with a standard deviation of 0.98. Approximating to a normal distribution
suggests that 95% of players have a rating between 4.35 and 8.27.
Figure 1: Histogram showing the distribution of Sky Sports player ratings
Similarly, Figure 2 shows the distribution of the average Sky Sports user rating
for all players over all games in the season. The mean Sky Sports user rating is
slightly lower at 5.80, but with a standard deviation of 1.25. This higher variance
is to be expected, given that the user ratings are determined by a public vote.
Again, approximating to a normal distribution suggests that 95% of players have
7
![Page 10: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/10.jpg)
a rating between 3.30 and 8.30. This is a broader range, but note that this range
is shifted to the left, towards the lower ratings, suggesting that fans are harsher
critics than pundits. This could also be somewhat due to the fact that users’
votes can be biased towards their favourite team.
Figure 2: Histogram showing the distribution of Sky Sports user ratings
8
![Page 11: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/11.jpg)
3 Statistics
3.1 Di↵erentiating Top Performers in Each Position
In order to e↵ectively classify the performance of players, it is useful to examine
the events that di↵erentiate the top players from the rest. Here, we compare the
10% of players in each position, to the rest of the players in the same position.
The rankings of players, for this purpose, are determined by the Sky Sports
player ratings and averaged over all games in the season, for each player. We
examine the distributions of individual events for all players in each position,
compared to the top 10% of players in each position.
Figure 3 compares goals scored by all midfielders, with goals scored by the top
10% of midfielders. While there is not a perfect correlation, there is certainly a
relationship between goals scored and performance rating for midfielders. For
all midfielders, the mean goals scored over the season is 2.24; for the top 10%,
it is 5.81. In the top 10%, there are also only two players who have scored less
than three goals, with 50% of players scoring 6 goals or more.
Figure 3: Distribution comparing goals scored by all midfielders and the top
10% of midfielders
Figure 4 shows goals scored by all forwards versus goals scored by the top 10%
of forwards. There is a clear link between goals scored and performance rating
for forwards. The mean number of goals scored, over the course of the season,
9
![Page 12: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/12.jpg)
for all forwards, is 5.53. No forward in the top 10% scored less than 9 goals,
more than one and a half times the mean for all forwards.
Figure 4: Distribution comparing goals scored by all forwards and the top 10%
of forwards
Figure 5 compares clearances made by all defenders with the top 10% of defenders.
The link between clearances and performance rating for defenders is less obvious
than the previous relationships between goals scored and performance rating.
While the mean clearances is higher for the top 10% of defenders, 162 versus 113
for all defenders, 50% of top performing defenders made less than 120 clearances.
Figure 5: Distribution comparing clearances made by all defenders and the top
10% of defenders
The relationships between two further features and performance rating, for
10
![Page 13: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/13.jpg)
defenders, midfielders and forwards are represented in Appendix C. This analysis
does not extend to goalkeepers, since there are fewer relevant features for
analysing the performance of goalkeepers.
3.2 Best Players in Each Position
Figure 6 shows a radar graph, comparing the top midfielder and forward to the
average in each position respectively. The top midfielder last season, again by
Sky Sports player ratings, was David Silva. The top forward was Luis Suarez.
The radar graph gives an indication of the areas in which these top player excel
compared to the average in each position. The features are normalised between
zero and one, enabling the radar graph to give an indication of the relative
magnitudes of features. In midfield, the key areas in which top player David
Silva excels are passes and chances created, where he significantly exceeds the
average, followed by goals and shots. In the forward position, Suarez scored the
most goals and took the most shots and created significantly more chances than
the average forward.
Figure 6: Radar graph comparing the highest rated forward and midfielder to
the average in each position
Figure 7 shows the radar graph comparing last season’s top defender, Luke Shaw,
with the average defender. Shaw made the most tackles of all defenders last
11
![Page 14: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/14.jpg)
season and made significantly more passes than the average defender. Besides
these key areas, he slightly exceeded the average defender in chances created,
clearances and interceptions.
Figure 7: Radar graph comparing the highest rated defender to the average
defender
3.3 Link between Individual and Team Performance
Figure 8 shows the relationship between Sky Sports player rating, averaged
over each player in every team for all matches in the season, and final team
points at the end of the season. This can be seen as a proxy for whether a team
underperformed or outperformed over the course of the season. The trend line
on Figure 8 shows the expected final team points, based on average Sky Sports
player rating. A team below the trend line is deemed to have underperformed
relative to their average player performance, whereas a team above the trend
line is deemed to have scored more points than would be expected, given the
team’s average Sky Sports player rating.
12
![Page 15: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/15.jpg)
Figure 8: Graph showing the link between average player rating and total team
points in the 2013/14 EPL
We hypothesise that two possible explanations may exist for a team significantly
outperforming relative to average player rating. The first explanation is the
presence of a ‘star player’ who significantly outperforms all other players in the
team. The second, quite simply, is tactical astuteness, enabling teams to perform
beyond the sum of its parts.
The three teams with the largest Euclidian distance above the trend line are
Tottenham, Manchester United and Liverpool. Looking at the standard deviation
of Sky Sports player ratings within each team gives an indication as to the spread
of player performance within the team; the presence of a key player could increase
standard deviation. The mean of the standard deviations across all teams is
0.97. The standard deviation for Tottenham is also 0.97, however the figures for
Manchester United and Liverpool are 1.03 and 1.07 respectively, considerably
exceeding the mean. Ranking teams by standard deviation puts Liverpool 3rd,
Manchester United 5th and Tottenham 8th out of the twenty teams in the league.
We take a closer look at the distribution of average ratings for the teams in
question. For Liverpool, Luis Suarez has an average player rating of 7.60, which
significantly exceeeds that of second ranked player Gerrard, whose player rating
is 7.21. This di↵erence of 0.39 rating points is the largest di↵erence between any
two player ratings at the club.
13
![Page 16: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/16.jpg)
Moving on to Manchester United, Wayne Rooney is the highest ranked player,
with an average player rating of 7.31, which significantly exceeds that of second
ranked player Adnan Januzaj, who has an average player rating of 6.96. This
di↵erence of 0.35 points is the second largest di↵erence between any two player
ratings at the club.
Finally, we look at Tottenham. Joint highest ranked players are Christian Eriksen
and Emmanuel Adebayor, with ratings of 7.0. These ratings are significantly
higher than that of third ranked player Paulinho, who averages 6.61. This
di↵erences of 0.39 points are the largest di↵erences in rating between any two
players at the club.
These results, for the three clubs with the highest outperformance relative to Sky
Sports average player ratings, provide strong evidence to suggest that significant
outperformance is explained by the presence of a ‘star player’ in the team,
or in the case of Tottenham, two ‘star players’ in the team, who significantly
outperform their team mates.
14
![Page 17: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/17.jpg)
4 Results
4.1 Feature Analysis
With a huge variety of events present in the data, as shown previously in Table
1, some initial intuition is necessary to broadly decide which events to examine.
Then, from this initial subset of events, redundant features can be eliminated and
the best features, which give the most information, retained. Feature analysis
involves choosing the most appropriate subset of features in order to build a
model to make predictions; in this case, on a player’s performance.
Whilst the aim is to classify performance, there exists no ground truth player
performance rating. Therefore, it is necessary to test methods on another
characteristic, for which a ground truth does exist, in order to compare the
classifier’s prediction to known values and determine the e↵ectiveness of a method.
For this purpose, we use a player’s position.
Before selecting features, however, it is necessary to pre-process the data. Figure
9 shows the distribution of events per player, for all players. Clearly, there are a
large number of players that have performed relatively few events. These are
the players that appear largely as substitutes, rarely starting games. There is
not enough data to classify these players and it is necessary to remove them, as
their presence could increase inaccuracies in the classifier.
Figure 9: Histogram showing the distribution of events per player
15
![Page 18: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/18.jpg)
On average, 27 players players play in each match. This corresponds to 22
starters (11 per team) and five substitutes; each team uses on average 2.5 out
of a maximum of three substitutes. The bin highlighted in red in Figure 9 is
the tail that has been removed, and this corresponds roughly to 527 of the data:
the substitutes. Before eliminating the substitutes, the mean number of events
per player is 1958, after elimination, the mean number is 2404, illustrating the
manner in which substitutes skew the data.
We choose a broad group of features, before using feature selection to find an
appropriate subset. This group of features consists of goals, shots, chances
created, successful passes, successful tackles, clearances, interceptions and saves.
In order to determine the features which best characterise a player’s position,
we look at each event individually and take a histogram of the frequency of
occurrence of each event for the players in each position (goalkeepers, defenders,
midfielders and forwards), separately.
Joining the peaks of each bin to create a line graph and overlaying the plots for
each position gives a visual representation of how each feature varies between
players in each position. Better features have less overlap between positions and
can distinguish between groups of players e↵ectively.
Since the distributions of events in each position vary considerably, it is di�cult to
approximate to a particular probability distribution in order to use a quantitative
rule, such as a distance threshold between distributions, for feature selection.
For this reason, combined with the fact that the events distributions can easily
be analysed visually, we use qualitative feature analysis.
Figures 10 shows the distribution of clearances for players in each position and
this is an example of a good feature. The lines for each player, representing the
peaks of each bin in the histogram, are distinct with no overlap, indicating that
the feature di↵erentiates between players e↵ectively. Similarly, Figure 11 is an
essential feature, since this distinguishes goalkeepers from all other positions.
16
![Page 19: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/19.jpg)
Figure 11: Distribution of saves for players in each position
Figure 10: Distribution of clearances for players in each position
Figure 12, the distribution of successful tackles, is an example of a feature which
gives less information. There are overlaps between all outfield players and as
such, the feature is less likely to assist in di↵erentiating between players in
di↵erent positions.
17
![Page 20: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/20.jpg)
Figure 12: Distribution of successful tackles for players in each position
The graphs for some of the remaining events are shown in Appendix A. Based
on these, we decide that the best features are passes, goals, saves, clearances,
chances created and interceptions. Here, we look at the results of the clustering
using the selected features.
4.2 Unsupervised Learning
4.2.1 Clustering by Events
Initially, we look at using an unsupervised learning approach to clustering. This
involves using the features to find clusters of similar players in the data, without
using any training labels.
It is necessary to normalise all the features, as the magnitudes vary significantly
for each feature. For example, passes has the largest magnitude by some distance
for most players, ranging from 64 to 1870. Goals, however, varies from 0 to 31.
Rescaling ensures that an approximately equal weight is placed on each feature.
We use k-means for the unsupervised cluster analysis. K-means aims to split
n data points into k clusters, by minimising the within-cluster sum of squared
distances between each data point and the mean of the cluster. The k-means are
18
![Page 21: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/21.jpg)
initialised, often to random values and then the algorithm proceeds iteratively,
in two-steps. Each data point is assigned to the cluster that minimises the
within-cluster sum of squares and in the second step, the means are updated
according to the new cluster membership. When the cluster assignment ceases to
change, the algorithm terminates, having reached a local optimum configuration
[5].
K-means takes the number of clusters as a parameter and here we use four
clusters, since there are four possible positions. Analysing the membership of
the k-means predicted clusters with the known positions shows that players are
not clustered distinctly by position, however analysing the membership of each
cluster yields some interesting results.
Figure 13 shows the distribution of player positions in each cluster (1,2,3,4).
Cluster 3 is made up entirely of goalkeepers, as would be expected, since one of
the features is saves and this is zero for any non-goalkeepers. This cluster has
23 members. Cluster 4 contains a heterogeneous mixture of all positions and
has a membership of 199 players. However cluster 1 and 2 are most interesting,
since cluster 1 contains predominantly forwards and midfielders and cluster 2
contains predominantly defenders and midfielders. Cluster 1 has a membership
of 93 players and cluster 2 has a membership of 111.
19
![Page 22: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/22.jpg)
(a) Cluster 1 (b) Cluster 2
(c) Cluster 3 (d) Cluster 4
Figure 13: Distribution of player positions within each cluster
Whereas previously the midfield used to be a fairly distinct position, with most
teams playing a 4-4-2 formation, this is no longer the case in the modern game.
Now, midfielders are largely classified as defensive midfielders who sit in front
of the defence, or attacking midfielders who interact closely with the strikers,
20
![Page 23: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/23.jpg)
with many teams adopting a 4-2-3-1 formation, with two defensive and three
attacking midfielders. In fact, some analysts group ‘traditional strikers’ and
attacking midfielders together as forwards, indicating the increasing fluidity in
the midfielder’s role.
To illustrate this, Table 2 splits the midfielders in clusters 1 and 2 into defensive
and attacking midfielders. This is determined by assessing the formations and
starting positions from Sky Sports [4] for each player in every team over the
course of the season. A player who starts the majority of matches in a holding
midfield role is classified as a defensive midfielder, whereas an attacking midfielder
is one who starts the majority of matches in an attacking midfield role.
As expected, cluster 2 contains 20 defensive midfielders, with just 3 attacking
midfielders. Cluster 1 contains 17 attacking midfielders and 6 defensive mid-
fielders. This confirms the dynamic nature of the role of the midfielder in the
modern game.
21
![Page 24: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/24.jpg)
Cluster 2 (Sample) Cluster 1 (Sample)
Midfielder Emphasis Midfielder Emphasis
Arteta Defensive Barkley Attacking
Bacuna Attacking Bolasie Attacking
Barry Defensive Willian Attacking
Britton Attacking Brunt Attacking
Carrick Defensive Chadli Attacking
Delph Defensive Cleverley Defensive
Elmohamady Attacking Cork Defensive
Gerrard Defensive Dyer Attacking
Henderson Defensive Howson Attacking
Huddlestone Defensive Kagawa Attacking
Jedinak Defensive Kim Bo-Kyung Attacking
Leiva Defensive Lallana Attacking
Fernandinho Defensive Meyler Attacking
McCarthy Defensive James Morrison Attacking
Medel Defensive N’Zonzi Defensive
Mulumbu Defensive Nolan Attacking
Noble Attacking Osman Attacking
Parker Defensive Ramires Defensive
Richardson Defensive Silva Attacking
Schneiderlin Defensive Townsend Attacking
Sidwell Defensive Whittingham Attacking
Tiote Defensive Wilshere Defensive
Whelan Defensive Yacob Defensive
Table 2: Playing style of midfielders in clusters 1 and 2
22
![Page 25: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/25.jpg)
4.2.2 Clustering by Spherical Coordinates
In addition to this method of clustering by using a set of features for each player,
another option is to cluster according to spherical coordinates: the ’angles’
between events. This corresponds to a change of coordinates in the feature space,
from Cartesian to spherical coordinates, so that each player is characterised by
the angles between the player’s features vector and the axes. Appendix B shows
some of the relationships between di↵erent features (absolute and angular).
Converting from the Cartesian system involves calculating the radius, r, which
is the Eucleidian distance from the origin to each point (one for each player) in
the feature space, as shown in (1), where x1 ... xn are the Cartesian coordinates
of each feature. The angular coordinate, �, is defined in (2)...(5), where for n
Cartesian coordinates, there exist n� 1 �.
r =pxn
2 + xn�1 + ...+ x22 + x1
2 (1)
�1 = arccosx1p
xn2 + xn�1
2 + ...+ x12
(2)
�2 = arccosx2p
xn2 + xn�1
2 + ...+ x22
(3)
...
�n�2 = arccotxn�2p
xn2 + xn�1
2(4)
�n�1 = 2arccotxn�1 +
pxn
2 + xn�12
xn(5)
Running k-means on the spherical coordinates derived from all the selected
features, for each player, produces similar clustering to the previous method
(using the feature set), but the results are more pronounced, as shown in Fig.
14.
23
![Page 26: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/26.jpg)
Similarly, there is a cluster that contains solely goalkeepers (cluster 3). Once
again, as before there is a cluster that contains predominantly defenders and
midfielders (cluster 4) and a cluster that contains predominantly midfielders and
forwards (cluster 1). However, with this method, the clusters are more pure -
there is only 1 defender in cluster 1, compared to 75 forwards and 64 midfielders
and just 2 forwards in cluster 4, compared to 44 defenders and 108 midfielders.
This result suggests that the relative frequency between events (as measured by
the spherical angles) rather then their absolute number, is a better quantity to
predict a player’s position.
Whilst the results are largely positive, the clusters are not pure enough to
suggest that this method would be accurate for rating players. As such, we use
a supervised learning algorithm to classify ratings.
24
![Page 27: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/27.jpg)
(a) Cluster 1 (b) Cluster 2
(c) Cluster 3 (d) Cluster 4
Figure 14: Distribution of player positions within each cluster (angular coordi-
nates)
25
![Page 28: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/28.jpg)
4.3 Supervised Learning
Supervised learning involves separating labelled data into a training and test
set, using the training set to train a classifier, which can be used to predict a
label for the test set. In this case, we again look at classifying the positions of
players before classifying performance.
Here, we use the k-nearest neighbours algorithm for supervised learning. In
k-nearest neighbours classification, the distance between each test data point
and the k-nearest training data points is calculated. The class label of the test
data point is determined by a majority class vote among its k nearest neighbours
[6].
The number of nearest neighbours is required as a parameter, and we also vary
the size of the training set in order to find the optimum, which best classifies
the test set. In the first case, for classifying player position, the training data
consists of a scaled feature vector for each player, as well as the player position
as a label. Then, once the classifier is trained, given the feature set for the test
data, the classifier can predict the player’s position.
This works similarly for classifying performance, except rather than giving the
player’s position as a training label, the average Sky Sports player rating for
that player, over the course of the EPL 2013/14 season is given as a training
label. For the test set, the classifier predicts the player’s rating based on the
feature set given.
By using supervised learning, we can measure the performance of the classifier,
since we can compare the predicted output against the training data labels. In
the case of player position, we examine whether the classifier predicted position
is the same as the actual position. For performance, if the predicted rating is
within 0.5 points (ratings given out of 10) of the Sky Sports player rating, this
is considered a success, outside this range is considered a fail.
In order to truly test the accuracy of the classifier and to give an idea of whether
it will work on an independent data set, we use cross validation. Here, we
26
![Page 29: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/29.jpg)
use a loop to shu✏e the data and then split the data into training and test
datasets. This is performed 1000 times in order to create di↵erent training and
test datasets. The model also varies the number of nearest neighbours from 2 to
4 for positions and 1 to 3 for ratings and the size of the training set from 10%
to 50%, in order to optimise the classifier. The results of each permutation are
averaged and displayed graphically.
Error bars are displayed on each of the classification accuracy graphs. Each
time the data set is shu✏ed, the accuracy is stored; the mean is shown on the
accuracy graphs and the error bar shows one standard deviation each side of the
mean. A large standard deviation suggests high sensitivity to the training data.
We start with classifying player positions and Figure 15 shows the accuracy of
the classifier in predicting the positions of each player. Accuracy increases as the
percentage of training data and the number of nearest neighbours are increased.
The optimum accuracy achieved is 82.0%, with 50% training data and three
nearest neighbours.
Figure 15: Classification accuracy for player positions as parameters varied
We can use the same method to classify player performance; Figure 16 shows the
classifier accuracy. As before, the number of nearest neighbours used and the
percentage of training data is varied in order to find the optimum configuration.
Again, a 50% training set produces the best results, but for classifying player
27
![Page 30: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/30.jpg)
performance, two nearest neighbours is optimum. Crucially, this configuration
of 50% training data and one nearest neigbour has a high success rate of 85.0%.
With just 10% training data and k = 1, a similar level of accuracy is achieved:
82.7%. For the k-nearest neighbours algorithm, when k=1, the variance of
the classifier is highest. The fact that k=1 works successfully in this situation
suggests that the most nearby examples, i.e. players with the most similar
feature sets, have very similar ratings.
Figure 16: Classification accuracy for player ratings as parameters varied
In order to test the significance of the results, we run the classifier again, but
rather than comparing the classifier predicted rating to the actual sky rating, we
reshu✏e the sky ratings and compare. This simulates random guessing. Figure
17 shows the classifier accuracy for random guessing, comparing test data with
shu✏ed sky ratings. Clearly, accuracy is significantly lower than for the classifier
when compared to the actual Sky player rating: 66.3% versus 85.0%. With
random guessing, the standard deviation is also significantly higher: rising from
an average of 3.4 in the actual classification to 9.1, suggesting high sensitivity to
training data.
28
![Page 31: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/31.jpg)
Figure 17: Classification accuracy for randomly shu✏ed Sky ratings
Since cross validation is used, it is reasonable to assume that this method could
work for other leagues, provided there are enough games in a season.
Since using spherical spherical coordinates worked successfully in clustering, we
attempt to use spherical coordinates for classification instead of using scaled
features. The spherical coordinates correspond to the angles between the features,
as opposed to the magnitudes of the features. Figure 18 shows the classification
accuracy using this method. The results are not as accurate as achieved previ-
ously, using scaled features: the maximum accuracy is 75.6% with two nearest
neighbours and 40% training data. This result suggests that the magnitude of
events, rather then the relative frequency between events, is a better quantity to
predict a player’s performance rating.
29
![Page 32: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/32.jpg)
Figure 18: Classification accuracy for player ratings as parameters varied, using
angular coordinates
4.3.1 Individual Matches
In order to test the classifier further, we examine its e↵ect on classifying the
performance of players in one particular match. This is a tough task, since the
variance is significantly higher in an individual match than over the course of a
season. Additionally, with significantly fewer events in a single match than over
the course of a season, event sets for each player are sparse and it is harder to
determine a player performance rating.
Figure 19 shows the performance of the classifier on a match between Manchester
United and Chelsea and Figure 20 shows the classifier’s performance on a match
between Stoke City and Hull City.
30
![Page 33: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/33.jpg)
Figure 19: Classification accuracy for a single match: Manchester United vs
Chelsea
The classifier does not work as e↵ectively for individual matches. Player’s
performances are not consistent throughout a season and the feature set for a
player in an individual match is often rather empty, making classification di�cult.
In the Manchester United vs Chelsea game, the classifier is 52.8% successful
using two nearest neighbours and a 50% training set. In the Stoke City vs Hull
City match, the classifier is 58.0% successful with three nearest neighbours and
a 50% training set.
Figure 20: Classification accuracy for a single match: Stoke City vs Hull City
31
![Page 34: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/34.jpg)
Lastly we examine the e↵ect of relaxing the margin for an accurate classification
to within 1.0 points of the Sky rating (as opposed to 0.5), for the Stoke City vs
Hull City match. Classification accuracy increases to a peak of 88.4% with three
nearest neighbours and either 30% or 50% training data.
Figure 21: Classification accuracy for a single match: Stoke City vs Hull City,
using a 1.0 point interval
The classifier works very well for a large dataset, which takes averages across a
season, but some further tuning is necessary in order for it to work accurately on
individual matches. However, since cross validation has been used, this classifier
should work on an unseen test dataset of player features, over the course of a
season, for another league or for other seasons of the EPL.
32
![Page 35: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/35.jpg)
5 Discussion and Conclusions
Machine learning techniques o↵er an interesting and viable alternative to the
traditional points based system of ranking football players. Over the course of a
season, k-nearest neighbours can accurately predict a player’s performance rating.
The ability to accurately predict a player’s performance rating has a variety
of useful implications. Fans and pundits can analyse matches more thoroughly
and teams can give better feedback to their players and analyse opponents with
greater accuracy.
However, when analysing a single game in isolation, the performance prediction is
less accurate, since players do not perform uniformly throughout the season and
player ratings in a single match have higher variance than over the course of a
season. This is an area in which further work is needed, since the ability to predict
performance in a single match would enable analysis of individual performances
on a match-by-match basis. This is an advantage that the traditional, points
based scoring system of ranking players has over machine learning.
There are a number of improvements that could be made and next steps that
could be taken, in order to further this analysis. Training the classifier using
another set of player ratings, as labels, would test the classifier’s accuracy further.
Since no ground truth exists for the performance rating of a player, it could be
useful to examine di↵erent player performance ratings for use in training the
classifier.
The use of a quantitative method of feature selection, such as using the Hellinger
or Kolmogorov-Smirnov distance to measure the distance between distributions,
could improve classification accuracy. A qualitative method has been used here,
since it is di�cult to approximate the distributions of events, across di↵erent
positions, to a probability distribution. More accurate feature selection could, in
particular, improve the classifier accuracy when analysing individual matches,
where there is significant room for improvements.
Finally, whilst results obtained from this study are significant, there is a limitation
33
![Page 36: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/36.jpg)
in that only one league over the period of one season is considered. Cross
validation is used to test the applicability of applying the classifier to other
datasets, however, in order to fully analyse the feasibility of using machine
learning techniques more widely to classify player performance, other seasons,
certainly, and ideally other leagues, must be considered.
34
![Page 37: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/37.jpg)
List of Figures
1 Histogram showing the distribution of Sky Sports player ratings . 7
2 Histogram showing the distribution of Sky Sports user ratings . . 8
3 Distribution comparing goals scored by all midfielders and the
top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Distribution comparing goals scored by all forwards and the top
10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Distribution comparing clearances made by all defenders and the
top 10% of defenders . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Radar graph comparing the highest rated forward and midfielder
to the average in each position . . . . . . . . . . . . . . . . . . . 11
7 Radar graph comparing the highest rated defender to the average
defender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8 Graph showing the link between average player rating and total
team points in the 2013/14 EPL . . . . . . . . . . . . . . . . . . 13
9 Histogram showing the distribution of events per player . . . . . 15
11 Distribution of saves for players in each position . . . . . . . . . 17
10 Distribution of clearances for players in each position . . . . . . . 17
12 Distribution of successful tackles for players in each position . . . 18
13 Distribution of player positions within each cluster . . . . . . . . 20
14 Distribution of player positions within each cluster (angular coor-
dinates) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
15 Classification accuracy for player positions as parameters varied . 27
16 Classification accuracy for player ratings as parameters varied . . 28
17 Classification accuracy for randomly shu✏ed Sky ratings . . . . . 29
18 Classification accuracy for player ratings as parameters varied,
using angular coordinates . . . . . . . . . . . . . . . . . . . . . . 30
19 Classification accuracy for a single match: Manchester United vs
Chelsea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
20 Classification accuracy for a single match: Stoke City vs Hull City 31
35
![Page 38: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/38.jpg)
21 Classification accuracy for a single match: Stoke City vs Hull City,
using a 1.0 point interval . . . . . . . . . . . . . . . . . . . . . . . 32
22 Distribution of goals for players in each position . . . . . . . . . 38
23 Distribution of shots for players in each position . . . . . . . . . 38
24 Distribution of chances created by players in each position . . . . 39
25 Relationship between the magnitudes of passes and goals; colour
coordinated by cluster: cluster 1: dark blue, cluster 2: light blue,
cluster 3: yellow, cluster 4: red . . . . . . . . . . . . . . . . . . . 39
26 Relationship between the magnitudes of saves and tackles; colour
coordinated by cluster: cluster 1: dark blue, cluster 2: light blue,
cluster 3: yellow, cluster 4: red . . . . . . . . . . . . . . . . . . . 40
27 Relationship between angular coordinates for clearances and tackles 40
28 Relationship between angular coordinates for goals and shots . . 41
29 Distribution comparing passes made by all midfielders and the
top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 41
30 Distribution comparing chances created by all midfielders and the
top 10% of midfielders . . . . . . . . . . . . . . . . . . . . . . . . 42
31 Distribution comparing chances created by all forwards and the
top 10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . 42
32 Distribution comparing shots taken by all forwards and the top
10% of forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
33 Distribution comparing interceptions made by all defenders and
the top 10% of defenders . . . . . . . . . . . . . . . . . . . . . . . 43
34 Distribution comparing tackles made by all defenders and the top
10% of defenders . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
List of Tables
1 All events and sub-events in the Squawka dataset . . . . . . . . . 6
2 Playing style of midfielders in clusters 1 and 2 . . . . . . . . . . . 22
36
![Page 39: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/39.jpg)
References
[1] Premier League. (2015). Rules: Scoring. Available: http://fantasy.
premierleague.com/rules/. Last accessed 24th April 2015.
[2] Squawka. (2015). Squawka: Football Stats, Live Match Data and Player
Statistics. Available: http://www.squawka.com/. Last accessed 24th April
2015.
[3] Lewis, M (2004). Moneyball. New York: W. W. Norton & Company.
[4] Sky. (2015). Sky Sports Football. Available: http://www1.skysports.com/
football/. Last accessed 24th April 2015.
[5] MacKay, J (2003). Information Theory, Inference, and Learning Algorithms.
4th ed. Cambridge: Cambridge University Press.
[6] Beyer, K et al. (1999). When Is “Nearest Neighbor” Meaningful?. Database
Theory – ICDT’99. 217-220.
37
![Page 40: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/40.jpg)
Appendix A Feature Selection
Figure 22: Distribution of goals for players in each position
Figure 23: Distribution of shots for players in each position
38
![Page 41: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/41.jpg)
Figure 24: Distribution of chances created by players in each position
Appendix B Relationships Between Features
Figure 25: Relationship between the magnitudes of passes and goals; colour
coordinated by cluster: cluster 1: dark blue, cluster 2: light blue, cluster 3:
yellow, cluster 4: red
39
![Page 42: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/42.jpg)
Figure 26: Relationship between the magnitudes of saves and tackles; colour
coordinated by cluster: cluster 1: dark blue, cluster 2: light blue, cluster 3:
yellow, cluster 4: red
Figure 27: Relationship between angular coordinates for clearances and tackles
40
![Page 43: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/43.jpg)
Figure 28: Relationship between angular coordinates for goals and shots
Appendix C Di↵erentiating Top Performing Play-
ers
Figure 29: Distribution comparing passes made by all midfielders and the top
10% of midfielders
41
![Page 44: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/44.jpg)
Figure 30: Distribution comparing chances created by all midfielders and the
top 10% of midfielders
Figure 31: Distribution comparing chances created by all forwards and the top
10% of forwards
42
![Page 45: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/45.jpg)
Figure 32: Distribution comparing shots taken by all forwards and the top 10%
of forwards
Figure 33: Distribution comparing interceptions made by all defenders and the
top 10% of defenders
43
![Page 46: DEPARTMENT OF ENGINEERING MATHEMATICS · The Squawka dataset is extremely rich and separates the events that occur in each match into a variety of di erent categories. Table 1 shows](https://reader033.fdocuments.net/reader033/viewer/2022052016/602ef28c66307c1b6e36a847/html5/thumbnails/46.jpg)
Figure 34: Distribution comparing tackles made by all defenders and the top
10% of defenders
44