Post on 18-Jan-2017
MBA 6637 Business Analytics Term Project, Deliverable 4
April 21, 2016
Madeline RussellAJ DeCato
Kati TaryoleAlicia Connolly
Table of Contents
Business Understanding.................................................................................................................3
Data Understanding & Preparation................................................................................................5
Modeling....................................................................................................................................... 6
Evaluation......................................................................................................................................8
Deployment and Lessons Learned...............................................................................................10
Appendix I – Variable Definitions...............................................................................................13
Appendix II – JMP Outputs.........................................................................................................16
Appendix III – Model Deployment Analysis...............................................................................35
Business UnderstandingOur project involves the Ultimate Fighting Championship (UFC). The business application will
be from the perspective of a gambling service company. The goal was to obtain relevant
statistical information and determine what characteristics would allow a gambling service
company to determine the winning odds. The data mining/predictive analytics problem is
determining what characteristics of a fighter make them more likely to win.
Sports gambling is on the rise. From a business perspective of a sportsbook, the trend is growing
wildly. An analytical understanding of the odds involved will make or break the bank. According
to the American Gambling Association, sports betting has shown at least a 14.6% increase yearly
for the last five years1. In 2015, almost $4.5 billion was legally gambled, specifically in the
Nevada sports books. To fully comprehend the economic reach and impact of sports betting, the
illegal betting market for the ‘16 Super Bowl alone was speculated to be as large as $4.1 billion2.
Figure 1
1 americangaming.org/newsroom/press-releasess/daily-fantasy-sports-grows-nevada-sports-books-thrive2 americangaming.org/newsroom/press-releasess/nevada-record-shows-sport-betting-americas-new-national-pastime
There are several sports betting options out there, but there is one leader in the market that has a
clear understanding of how the odds work; this is Bovada.lv3. Sources for similar operating
sports books in which an understanding of sports analytics is critical includes BetOnline.ag4 and
TopBet.eu5. Below is an example from Bovada.lv of one MMA/UFC fight. Each fighter has
either over, under, or even odds to win. The further negative the odds, the more likely one is to
win. This means, as the analysts believe at Bovada, Anthony Pettis has the better odds to win.
Furthermore, the -175 betting line on Pettis means that if a person bets $100 they would only win
$57.14 or $100 / (175/100). On the same bet for the opponent Edson Barboza, one would receive
$145 for a win or $100 * (145/100).
Figure 2
In the sports booking world, it is vital to understand sport statistics, especially those of UFC
fighters in order to set the odds appropriately. If the correct payout odds are not stated correctly,
or the wrong fighter is projected to win, a company stands to lose a significant amount of money
on winning payouts. The wrong justification of fighter data of the smallest proportion would
potentially put a booking company out of business.
3 Business Source: sports.bovada.lv/ufc-mma4 Business Source: betonline.ag/sportsbook5 Business Source: topbet.eu/sportsbook/mma
Data Understanding & PreparationUFC statistics were pulled from the website FightMetric LLC6 using the web crawler Import-IO.
This data was pulled for fights from November 1993 to March 2016. The data that was originally
pulled listed individual fighter statistics for each fight during that time period and included
fighter, number of successful strikes, number of successful takedowns, number of successful
submissions, number of successful passes, fighter’s height, fighter’s weight, fighter’s reach,
fighter’s stance, the round in which the fight was won, time within the round the fight was won,
and the outcome of the fight (win, lose, or draw). With this data the original target variable was
wins, however, the data was duplicated since there were two fighters per fight. The data was then
reformulated for a total of strikes, takedowns, submission, and passes for each fighter. This data
set contained 2,049 individual fighters. Appendix I lists and defines all variables pulled,
changed, and used throughout the project.
There were a number of issues with the data set, first off the round and time of the fight were
similar variables, the formulation pulled made these unusable variables so they were taken out of
the data set. There was a number of missing variables on each fighter as well. Research was done
to fill in as many missing variables as possible. Reach was one of the biggest missing variables
on individual fighters even after a search was done to find this information. According to
University of Texas Medical Branch7, an adult’s reach is approximately 5 cm or 2 inches longer
than their height, therefore an imputed reach using this formula was calculated for every fighter
whose reach was missing. For many fighters the number of the strikes, takedowns, passes, and
submissions were missing as well. UFC is fairly a new sport, the data collected over time has
changed, this explains why there is so many missing information. Since there were many missing
6 http://fightmetric.com/statistics/events/completed7 http://www.utmb.edu/pedi_ed/CORE/Endocrine/page_09.htm
values entered as zeros, fighters whose data was not collected were taken out. This removed
roughly 200 fighters or approximately 10% of the data set.
After the data cleaning was done, it was then determined that proportion of wins would be a
more appropriate target variable based on how the data was collected. Since proportion of wins
was now the target variable it made sense to change strikes, takedowns, passes, and submission
to an average per fight instead of an overall count. These variables were right-skewed due to the
amount of missing data, therefore a log plus one was taken for these variables (shown in JMP
output #1). See Appendix I for specifics on variables including transformations.
ModelingAfter creating our data set, we started out with running descriptive statistics to understand our
data more (see Appendix II for all relevant JMP outputs). Most of the descriptive statistics
(shown in JMP output #1) showed a heavy right skew with multiple outliers. To get a more
normalized data set we began by log transforming several variables (shown in JMP output #2).
This helped fix the skewness a little and we started to see a relationship between the data.
To determine any multicollinearity (how related variables are with each other), we checked the
scatterplot matrix (shown in JMP output #3). Upon evaluation, we had to extract a variable due
to the correlation coefficient being over 0.8, the scrapped variable was proportion of passes. This
left us with six variables to use for the rest of our model. Since our target variable, proportion
wins, is continuous we used a regression tree (shown in JMP output #4). Our regression tree
(without TVT) yielded an R-squared of 0.167 and evened out after 12 splits. The R-squared tells
us that about 16.7% of the variation in the proportion of wins is explained by the model given the
included variables. In addition to this, the RMSE for our model was 0.1535, which is the average
deviation from the line. The regression tree conveyed to us that the log of proportion of
takedowns and log of proportion of strikes influenced our target variable the most at 35.6% and
27% respectively, followed by weight at 23.1% and imputed reach at 14.3%.
In addition to the regression tree, we ran a multiple regression model (without TVT) (shown in
JMP output #5) to obtain the odds ratio. From the multiple regression we came up with the
following formula which allowed us to interpret the coefficients:
Prop Wins = 0.47 – 0.001(Weight) + 0.004(Imp. Reach) – 0.008(Stance(Other & Orthodox-
Southpaw)) + 0.026 (Ln Prop Strike) + 0.085 (Ln Prop Td) – 0.069 (Ln Prop Sub)
After running the multiple regression, we saved our residuals so we could determine if they were
normally distributed. First, we reran our descriptive statistics with the residuals included (shown
in JMP output #6). Next we used the graph builder to see if there was a relationship between
the residual proportion of wins versus the predicted proportion of wins (shown in JMP output
#7). From this, we were able to determine that our residuals looked normal. In addition to this,
we built another scatterplot to see if the variable “imputed reach” was evenly distributed
throughout our data (shown in JMP output #8). The graph shows that there is a fairly even
distribution.
The last few models we ran were for the training, validation, test set (TVT) (shown in JMP
output #9). In the TVT regression tree, to abstain from over-fitting the data we were only able to
split 4 times. We ended up with a R-squared of 8.5% for training, 9.2% for validation and 13.8%
for the test data set. Also within the regression tree we were able to determine that, much like our
first tree, the log of proportion of takedowns, weight, and the log of proportion of strikes
influenced our target variable the most with 51.56%, 33.15%, 15.28% respectively.
Using a TVT data set we were also able to run a multiple regression model (shown in JMP
output #10). We determined that the R-squared for this model garnered 11.18% (training),
9.48% (validation), and 3.4% (test), and gained a new equation:
Prop Wins = 0.475 – 0.001(Weight) + 0.004(Imp. Reach) – 0.017 (Stance[Orthodox]) +
0.011(Stance[Other]) + 0.021(Ln Prop Strike) + 0.067(Ln Prop Td) + 0.018(Ln Prop Sub)
EvaluationWe were able to use TVT (Train-Validate-Test) when creating the models for our data because
we had a large enough data set. We chose to split our data with 60% in training, 20% in
validation, and 20% in test. The regression tree had a total of four splits with a minimum split
size of ten. According to the regression tree, the most informative attribute in predicting
proportion of wins is log of proportion of takedowns, followed by weight and log of proportion
of strikes. Imputed reach, stance, and log of proportion of submissions were not informative
attributes in predicting proportion of wins.
Fighters with a log of proportion of takedowns greater than or equal to .0588 and a log of
proportion of strikes greater than or equal to 1.634 had the highest average proportion of wins
at .720. Additionally, fighters with a log of proportion of takedowns greater than or equal
to .0588, log proportion of strikes less than 1.634, and weight less than 194 pounds had an
average proportion of wins of .6955. Fighters with a log of proportion of takedowns less
than .0588 and a weight less than 178 pounds had an average proportion of wins of .6605.
Fighters with a log of proportion of takedowns less than .0588 and a weight greater than or equal
to 178 pounds had an average proportion of wins of .5998. Lastly, fighters with a log of
proportion of takedowns greater than or equal to .0588, a log of proportion of strikes less than
1.634, and a weight greater than or equal to 194 pounds had an average proportion of wins
of .5862.
According the multiple regression model, weight is the most informative attribute in predicting
proportion of wins followed by imputed reach, log of proportion of strikes, log of proportion of
takedowns, stance, and log of proportion of submissions. For every 10 pound increase in weight,
proportion of wins is expected to decrease by .01. For every 1 inch increase in reach, proportion
of wins is expected to increase by .005. If a fighter’s stance is Orthodox, proportion of wins is
expected to decrease by .017. If a fighter’s stance is Other, proportion of wins is expected to
increase by .012. For every 1% increase in strikes, proportion of wins is expected to increase by
8.72. For every 1% increase in takedowns, proportion of wins is expected to increase by 822. For
every 1% increase in submissions, proportion of wins is expected to increase by 6.29.
Additionally, the VIFs (Variance Inflation Factor) for all variables in the multiple regression
model were under 5 which means that there wasn’t any redundant information in the model.
Overall, the multiple regression model had an R-squared of .034 for the test set and the
regression tree had an R-squared of .138 for the test set. Based on these results, the regression
tree is the better model, though both models had a very low R-squared. Additionally, the models
had some similarities and differences in the top 3 significant variables in predicting proportion of
wins as shown in Table 1.
Variable Regression Tree Multiple Regression
R-squared (Test) 0.138 0.034
Significant Variable 1 Proportion Takedowns Weight
Significant Variable 2 Weight Reach
Significant Variable 3 Proportion Strikes Proportion Strikes
Table 1
Deployment and Lessons LearnedAfter the model was finally cleaned up, isolated, the model analytics created and the necessary
testing finished, it was time to deploy the overall results against real market data to compare the
outcome and test its veracity. This process would enable us to attempt to fulfill the business
proposal we set out to achieve. As described before in our model’s breakdown, we have an
equation from multiple regression, which is the model represents an intercept and the variables
that affect a target. The deployment of this model is relatively simple where a fighter’s statistics
are plugged into a formula generating their stats used in the mining to predict the winning
outcome, one fighter versus another.
There does exist ethical questions, important risks and issues to be aware of when using this
particular model. Sports gambling in the majority of the United States is illegal. All operations
would need to be house in a state where this type of service is legal such as Nevada or out of the
country. All sportsbooks used in comparison are international firms, and as such, have its
members at risk as many states and banks still consider online gambling illegal if the resident
state outlaws the practice. Sports gambling is also continually in question ethically where many
lives are ruined as people cannot always play responsibly.
Regarding the risks, a model is only as good as its data. When running any type of prediction,
you can only see into the window of variables that you choose to create it against. This could
create limitations in the length of time that a particular set of data is useful or valid, as new
fighter statistics enter the pool almost every other week. Furthermore, it could be very risky to
set odds for and against fighters if incorrect skews and understanding of the data is present. At
this point in odds breakdown development and picking a winner, there is not enough extensive
data to set odds at large, risky payouts, however, at a model average of 62.6% accuracy, setting
odds in our favor with estimated winners at -250 under, and estimated long shots at 150 we end
profitable. When fighters are estimated to be close, odds would be changed to -200, 115, to
mitigate risk but keep people betting revenue as high as possible.
Revenue Risk Model
Created Odds Outside Bet Favorite Payout Longshot Payout
-250 $ 1,000 $ 400 $ -
150 $ 1,000 $ - $ 500
115 $ 1,000 $ - $ 150
-200 $ 1,000 $ 500 $ -
Table 2
With these risks in mind, a few implications can be gathered. First it is important to keep in mind
that this deployment model is created to set odds to earn revenue, not to make bets against other
book’s odds. Table 3 represents an example of our model versus the Vegas betting lines for UFC
Fight Night on March 19th, 2016
Model Odds Vegas Odds
8/12 66.7% 6/12 54.5%
Table 3
Our model produces an overall winning percentage average of 62.6% in the last six title fights
while Vegas odds produce an average of 60.9%. As shown in Appendix III, putting this average
against our internal line odds produces betting profits at an average of 28.6%. Through the
evaluation of the model, we believe that this data and its corresponding model outputs can help
solve our business problem, by allowing sportsbook analysts to make an easier decision on
which fighter to choose. Table 4 is an example of the deployment of the model. Note that both
our model and the Vegas lines were incorrect for fight two.
Fighter Estimate Result Vegas Odds
Fight 1: Leslie Smith 3.80 W -130
Rin Nakai 1.18 110
Fight 2: Viscardi Andrade 3.24 W 102
Richard Walsh 6.14 -122Table 4 (Green = accurate model pick; Yellow = accurate Vegas line pick)
Although the model does beat the average Vegas odds, there are a few improvements that could
be made to the data and its collection. First, this model only collected successful strikes in its
fighter data, versus full statistics on total attempted strikes. This data, with its associated
averages and proportions, could be of significance to this model. Furthermore, the background of
a fighter such as style of fighting, total career length, average fight time per match and a count of
types of win (knockout, decision, submission) could factor in to be very significant. In the future,
multiple sources could be used to not only verify and collect data, but specifically add other
possibly significant variables, making the model more effective.
Appendix I – Variable Definitions
Variables Description
Fighters Fighter is the person who we have collected data on involving their fight statistics in the UFC. This is a nominal (qualitative) variable and there are a total of 1847 fighters in our data set. Due to the nature of this variable we did not run descriptive statistics.
Strikes (Str) Strikes is the total number of punches a fighter has landed in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 66 with an IQR of 148. There aren’t any missing values for this variable. This variable does not appear to pose any significant problems. This variable was used to determine proportion of strikes out of total fights.
Takedowns (Td) Takedowns is the total of number of times a fighter has taken an opponent down in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that takedowns comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 2 with an IQR of 7. There aren’t any missing values for this variable. This variable was used to determine proportion of takedowns out of total fights.
Submission (Sub)
Submissions is the total number of times a fighter has caused an opponent to tap using some type of submission technique in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that submissions comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 1 with an IQR of 3. There aren’t any missing values for this variable. This variable was used to determine proportion of submissions out of total fights.
Pass (Pass)
Pass is the total number of times a fighter has used a grappling technique that signifies an advance or change to a dominant position within a fight in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that pass comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 2 with an IQR of 7. There aren’t any missing values for this variable. This variable was used to determine proportion of passes out of total fights.
Height (Ht) Height is how tall the fighter is in inches. This is a continuous (quantitative) variable. The descriptive statistics show that height comes from a normal distribution. There are a few outliers but the data does not appear to be skewed. The mean is 70.5 with a standard deviation of 3.34. There are 6 values missing for this variable. This variable was highly correlated with imputed reach, thus it was removed from the model.
Weight (Wt) Weight is how heavy the fighter is in pounds. This is a continuous (quantitative) variable. The descriptive statistics show that weight comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 170 with an IQR of 31. There are 7 values missing for this variable. This variable did not change much with a log transformation, thus it
was kept in the model without a transformation.
Reach Reach is the distance in inches from fingertip to fingertip when a fighter spreads their arms out. This is a continuous (quantitative) variable. The descriptive statistics show that reach comes from a normal distribution. There aren’t any outliers and the data does not appear to be skewed. The mean is 71.81 with a standard deviation of 3.98. There are 774 values missing for this variable. This variable does pose an issue since there are a significant number of values missing. Therefore an imputed reach was calculated to correct for the missing values.
Imputed Reach Imputed reach is a fighter’s height plus two inches. This is a continuous (quantitative) variable. The descriptive statistics show that reach comes from a normal distribution. There aren’t any outliers and the data does not appear to be skewed. The mean is 71.61 with a standard deviation of 5.55. There aren’t any values missing for this variable. This variable does not appear to pose any significant problems.
Stance Stance is the position that a fighter keeps their hands when they fight. Orthodox is fighting with the right hand back and southpaw is fighting with the left hand back. Switch means the fighter does not consistently fight orthodox or southpaw but rather switches between the two. Sideways and open stance are other variations of stance but they are not commonly used. Stances were combined into three categories: orthodox, southpaw, and other. This is a nominal (qualitative) variable. The descriptive statistics show that 71.74% of the fighters fight orthodox, 16.13% fight southpaw, and 12.13% are other or unknown. This variable has 3 missing. This variable does not appear to pose any significant problems.
Wins Wins is the total number of times a fighter has won in their career with the UFC. This is a continuous (quantitative) variable and it is our target variable. The descriptive statistics show that wins comes from a normal distribution, though there are some outliers and it appears right skewed. The median is 13.57 with an IQR of 10. There is 1 value missing for this variable. This variable does not appear to pose any significant problems.
Losses Losses is the total number of times a fighter has lost in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that losses comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 5.83 with an IQR of 5. There is 1 value missing for this variable. This variable does not appear to pose any significant problems.
Draws Draws is the total number of times a fighter has been involved in a fight that did not have a winner in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that draws comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 0.38 with an IQR of 0. There are no value missing for this variable. This variable does not appear to pose any significant problems.
Proportion of Wins (Prop Win)
Proportion of wins is a percentage of wins versus overall fights (wins, losses, and draws) in their career with UFC. This is a continuous (quantitative) variable and it is our target variable. The descriptive statistics show that wins comes from a normal distribution, though there are
some outliers and it appears left skewed. The median is .7 with an IQR of .17. There are 2 values missing for this variable. This variable does not appear to pose any significant problems. This variable was ran with residuals changing the median to .02 and IRQ to .16.
Proportion of Strikes (Prop Str)
Proportion of strikes is the average number of successful strikes that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 4.467 with an IQR of 8.27. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transform the median is .7 with an IRQ of .7, the distribution is normal and there are very few outliers. The log transformation of this variable was ran with residuals, the median is 1.7 with an IRQ of 1.46.
Proportion of Takedowns (Prop Td)
Proportion of takedowns is the average number of successful takedowns that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .14 with an IQR of .41. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transform the median is 1.7 with an IRQ of 1.46. The distribution is more normal with the log transformation, there are still some outliers and it is still right skewed. The log transformation of this variable was ran with residuals, the median is .13 with an IRQ of .35.
Proportion of Submission (Prop Sub)
Proportion of submission is the average number of successful submissions that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .06 with an IQR of .19. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transformation the median is .06 with an IRQ of .17. The distribution is more normal with the log transformation, there are still outliers and it is still right skewed. The log transformation of this variable was ran with residuals, the median is .05 with an IRQ of .17.
Proportion of Pass (Prop Pass)
Proportion of pass is the average number of successful passes that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .13 with an IQR of .43. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transformation the median is .12 and the IRQ is .36. The distribution is more normal with the log transformation, there are still outliers and it is a bit right skewed. This variable was not used since it was highly correlated with proportion of takedowns.
Appendix II – JMP OutputsJMP Output #1
JMP Output #2
JMP Output #3
JMP Output #4
JMP Output #5
JMP Output #6
JMP Output #7
JMP Output #8
JMP Output #9
JMP Output #10
Appendix III – Model Deployment Analysis
(See Excel Attachment)