Project Paper - Final Deliverable

45
MBA 6637 Business Analytics Term Project, Deliverable 4 April 21, 2016 Madeline Russell AJ DeCato Kati Taryole Alicia Connolly

Transcript of Project Paper - Final Deliverable

Page 1: Project Paper - Final Deliverable

MBA 6637 Business Analytics Term Project, Deliverable 4

April 21, 2016

Madeline RussellAJ DeCato

Kati TaryoleAlicia Connolly

Page 2: Project Paper - Final Deliverable

Table of Contents

Business Understanding.................................................................................................................3

Data Understanding & Preparation................................................................................................5

Modeling....................................................................................................................................... 6

Evaluation......................................................................................................................................8

Deployment and Lessons Learned...............................................................................................10

Appendix I – Variable Definitions...............................................................................................13

Appendix II – JMP Outputs.........................................................................................................16

Appendix III – Model Deployment Analysis...............................................................................35

Page 3: Project Paper - Final Deliverable

Business UnderstandingOur project involves the Ultimate Fighting Championship (UFC). The business application will

be from the perspective of a gambling service company. The goal was to obtain relevant

statistical information and determine what characteristics would allow a gambling service

company to determine the winning odds. The data mining/predictive analytics problem is

determining what characteristics of a fighter make them more likely to win.

Sports gambling is on the rise. From a business perspective of a sportsbook, the trend is growing

wildly. An analytical understanding of the odds involved will make or break the bank. According

to the American Gambling Association, sports betting has shown at least a 14.6% increase yearly

for the last five years1. In 2015, almost $4.5 billion was legally gambled, specifically in the

Nevada sports books. To fully comprehend the economic reach and impact of sports betting, the

illegal betting market for the ‘16 Super Bowl alone was speculated to be as large as $4.1 billion2.

Figure 1

1 americangaming.org/newsroom/press-releasess/daily-fantasy-sports-grows-nevada-sports-books-thrive2 americangaming.org/newsroom/press-releasess/nevada-record-shows-sport-betting-americas-new-national-pastime

Page 4: Project Paper - Final Deliverable

There are several sports betting options out there, but there is one leader in the market that has a

clear understanding of how the odds work; this is Bovada.lv3. Sources for similar operating

sports books in which an understanding of sports analytics is critical includes BetOnline.ag4 and

TopBet.eu5. Below is an example from Bovada.lv of one MMA/UFC fight. Each fighter has

either over, under, or even odds to win. The further negative the odds, the more likely one is to

win. This means, as the analysts believe at Bovada, Anthony Pettis has the better odds to win.

Furthermore, the -175 betting line on Pettis means that if a person bets $100 they would only win

$57.14 or $100 / (175/100). On the same bet for the opponent Edson Barboza, one would receive

$145 for a win or $100 * (145/100).

Figure 2

In the sports booking world, it is vital to understand sport statistics, especially those of UFC

fighters in order to set the odds appropriately. If the correct payout odds are not stated correctly,

or the wrong fighter is projected to win, a company stands to lose a significant amount of money

on winning payouts. The wrong justification of fighter data of the smallest proportion would

potentially put a booking company out of business.

3 Business Source: sports.bovada.lv/ufc-mma4 Business Source: betonline.ag/sportsbook5 Business Source: topbet.eu/sportsbook/mma

Page 5: Project Paper - Final Deliverable

Data Understanding & PreparationUFC statistics were pulled from the website FightMetric LLC6 using the web crawler Import-IO.

This data was pulled for fights from November 1993 to March 2016. The data that was originally

pulled listed individual fighter statistics for each fight during that time period and included

fighter, number of successful strikes, number of successful takedowns, number of successful

submissions, number of successful passes, fighter’s height, fighter’s weight, fighter’s reach,

fighter’s stance, the round in which the fight was won, time within the round the fight was won,

and the outcome of the fight (win, lose, or draw). With this data the original target variable was

wins, however, the data was duplicated since there were two fighters per fight. The data was then

reformulated for a total of strikes, takedowns, submission, and passes for each fighter. This data

set contained 2,049 individual fighters. Appendix I lists and defines all variables pulled,

changed, and used throughout the project.

There were a number of issues with the data set, first off the round and time of the fight were

similar variables, the formulation pulled made these unusable variables so they were taken out of

the data set. There was a number of missing variables on each fighter as well. Research was done

to fill in as many missing variables as possible. Reach was one of the biggest missing variables

on individual fighters even after a search was done to find this information. According to

University of Texas Medical Branch7, an adult’s reach is approximately 5 cm or 2 inches longer

than their height, therefore an imputed reach using this formula was calculated for every fighter

whose reach was missing. For many fighters the number of the strikes, takedowns, passes, and

submissions were missing as well. UFC is fairly a new sport, the data collected over time has

changed, this explains why there is so many missing information. Since there were many missing

6 http://fightmetric.com/statistics/events/completed7 http://www.utmb.edu/pedi_ed/CORE/Endocrine/page_09.htm

Page 6: Project Paper - Final Deliverable

values entered as zeros, fighters whose data was not collected were taken out. This removed

roughly 200 fighters or approximately 10% of the data set.

After the data cleaning was done, it was then determined that proportion of wins would be a

more appropriate target variable based on how the data was collected. Since proportion of wins

was now the target variable it made sense to change strikes, takedowns, passes, and submission

to an average per fight instead of an overall count. These variables were right-skewed due to the

amount of missing data, therefore a log plus one was taken for these variables (shown in JMP

output #1). See Appendix I for specifics on variables including transformations.

ModelingAfter creating our data set, we started out with running descriptive statistics to understand our

data more (see Appendix II for all relevant JMP outputs). Most of the descriptive statistics

(shown in JMP output #1) showed a heavy right skew with multiple outliers. To get a more

normalized data set we began by log transforming several variables (shown in JMP output #2).

This helped fix the skewness a little and we started to see a relationship between the data.

To determine any multicollinearity (how related variables are with each other), we checked the

scatterplot matrix (shown in JMP output #3). Upon evaluation, we had to extract a variable due

to the correlation coefficient being over 0.8, the scrapped variable was proportion of passes. This

left us with six variables to use for the rest of our model. Since our target variable, proportion

wins, is continuous we used a regression tree (shown in JMP output #4). Our regression tree

(without TVT) yielded an R-squared of 0.167 and evened out after 12 splits. The R-squared tells

us that about 16.7% of the variation in the proportion of wins is explained by the model given the

included variables. In addition to this, the RMSE for our model was 0.1535, which is the average

Page 7: Project Paper - Final Deliverable

deviation from the line. The regression tree conveyed to us that the log of proportion of

takedowns and log of proportion of strikes influenced our target variable the most at 35.6% and

27% respectively, followed by weight at 23.1% and imputed reach at 14.3%.

In addition to the regression tree, we ran a multiple regression model (without TVT) (shown in

JMP output #5) to obtain the odds ratio. From the multiple regression we came up with the

following formula which allowed us to interpret the coefficients:

Prop Wins = 0.47 – 0.001(Weight) + 0.004(Imp. Reach) – 0.008(Stance(Other & Orthodox-

Southpaw)) + 0.026 (Ln Prop Strike) + 0.085 (Ln Prop Td) – 0.069 (Ln Prop Sub)

After running the multiple regression, we saved our residuals so we could determine if they were

normally distributed. First, we reran our descriptive statistics with the residuals included (shown

in JMP output #6). Next we used the graph builder to see if there was a relationship between

the residual proportion of wins versus the predicted proportion of wins (shown in JMP output

#7). From this, we were able to determine that our residuals looked normal. In addition to this,

we built another scatterplot to see if the variable “imputed reach” was evenly distributed

throughout our data (shown in JMP output #8). The graph shows that there is a fairly even

distribution.

The last few models we ran were for the training, validation, test set (TVT) (shown in JMP

output #9). In the TVT regression tree, to abstain from over-fitting the data we were only able to

split 4 times. We ended up with a R-squared of 8.5% for training, 9.2% for validation and 13.8%

for the test data set. Also within the regression tree we were able to determine that, much like our

first tree, the log of proportion of takedowns, weight, and the log of proportion of strikes

influenced our target variable the most with 51.56%, 33.15%, 15.28% respectively.

Page 8: Project Paper - Final Deliverable

Using a TVT data set we were also able to run a multiple regression model (shown in JMP

output #10). We determined that the R-squared for this model garnered 11.18% (training),

9.48% (validation), and 3.4% (test), and gained a new equation:

Prop Wins = 0.475 – 0.001(Weight) + 0.004(Imp. Reach) – 0.017 (Stance[Orthodox]) +

0.011(Stance[Other]) + 0.021(Ln Prop Strike) + 0.067(Ln Prop Td) + 0.018(Ln Prop Sub)

EvaluationWe were able to use TVT (Train-Validate-Test) when creating the models for our data because

we had a large enough data set. We chose to split our data with 60% in training, 20% in

validation, and 20% in test. The regression tree had a total of four splits with a minimum split

size of ten. According to the regression tree, the most informative attribute in predicting

proportion of wins is log of proportion of takedowns, followed by weight and log of proportion

of strikes. Imputed reach, stance, and log of proportion of submissions were not informative

attributes in predicting proportion of wins.

Fighters with a log of proportion of takedowns greater than or equal to .0588 and a log of

proportion of strikes greater than or equal to 1.634 had the highest average proportion of wins

at .720. Additionally, fighters with a log of proportion of takedowns greater than or equal

to .0588, log proportion of strikes less than 1.634, and weight less than 194 pounds had an

average proportion of wins of .6955. Fighters with a log of proportion of takedowns less

than .0588 and a weight less than 178 pounds had an average proportion of wins of .6605.

Fighters with a log of proportion of takedowns less than .0588 and a weight greater than or equal

to 178 pounds had an average proportion of wins of .5998. Lastly, fighters with a log of

proportion of takedowns greater than or equal to .0588, a log of proportion of strikes less than

Page 9: Project Paper - Final Deliverable

1.634, and a weight greater than or equal to 194 pounds had an average proportion of wins

of .5862.

According the multiple regression model, weight is the most informative attribute in predicting

proportion of wins followed by imputed reach, log of proportion of strikes, log of proportion of

takedowns, stance, and log of proportion of submissions. For every 10 pound increase in weight,

proportion of wins is expected to decrease by .01. For every 1 inch increase in reach, proportion

of wins is expected to increase by .005. If a fighter’s stance is Orthodox, proportion of wins is

expected to decrease by .017. If a fighter’s stance is Other, proportion of wins is expected to

increase by .012. For every 1% increase in strikes, proportion of wins is expected to increase by

8.72. For every 1% increase in takedowns, proportion of wins is expected to increase by 822. For

every 1% increase in submissions, proportion of wins is expected to increase by 6.29.

Additionally, the VIFs (Variance Inflation Factor) for all variables in the multiple regression

model were under 5 which means that there wasn’t any redundant information in the model.

Overall, the multiple regression model had an R-squared of .034 for the test set and the

regression tree had an R-squared of .138 for the test set. Based on these results, the regression

tree is the better model, though both models had a very low R-squared. Additionally, the models

had some similarities and differences in the top 3 significant variables in predicting proportion of

wins as shown in Table 1.

Page 10: Project Paper - Final Deliverable

Variable Regression Tree Multiple Regression

R-squared (Test) 0.138 0.034

Significant Variable 1 Proportion Takedowns Weight

Significant Variable 2 Weight Reach

Significant Variable 3 Proportion Strikes Proportion Strikes

Table 1

Deployment and Lessons LearnedAfter the model was finally cleaned up, isolated, the model analytics created and the necessary

testing finished, it was time to deploy the overall results against real market data to compare the

outcome and test its veracity. This process would enable us to attempt to fulfill the business

proposal we set out to achieve. As described before in our model’s breakdown, we have an

equation from multiple regression, which is the model represents an intercept and the variables

that affect a target. The deployment of this model is relatively simple where a fighter’s statistics

are plugged into a formula generating their stats used in the mining to predict the winning

outcome, one fighter versus another.

There does exist ethical questions, important risks and issues to be aware of when using this

particular model. Sports gambling in the majority of the United States is illegal. All operations

would need to be house in a state where this type of service is legal such as Nevada or out of the

country. All sportsbooks used in comparison are international firms, and as such, have its

members at risk as many states and banks still consider online gambling illegal if the resident

state outlaws the practice. Sports gambling is also continually in question ethically where many

lives are ruined as people cannot always play responsibly.

Page 11: Project Paper - Final Deliverable

Regarding the risks, a model is only as good as its data. When running any type of prediction,

you can only see into the window of variables that you choose to create it against. This could

create limitations in the length of time that a particular set of data is useful or valid, as new

fighter statistics enter the pool almost every other week. Furthermore, it could be very risky to

set odds for and against fighters if incorrect skews and understanding of the data is present. At

this point in odds breakdown development and picking a winner, there is not enough extensive

data to set odds at large, risky payouts, however, at a model average of 62.6% accuracy, setting

odds in our favor with estimated winners at -250 under, and estimated long shots at 150 we end

profitable. When fighters are estimated to be close, odds would be changed to -200, 115, to

mitigate risk but keep people betting revenue as high as possible.

Revenue Risk Model

Created Odds Outside Bet Favorite Payout Longshot Payout

-250 $ 1,000 $ 400 $ -

150 $ 1,000 $ - $ 500

115 $ 1,000 $ - $ 150

-200 $ 1,000 $ 500 $ -

Table 2

With these risks in mind, a few implications can be gathered. First it is important to keep in mind

that this deployment model is created to set odds to earn revenue, not to make bets against other

book’s odds. Table 3 represents an example of our model versus the Vegas betting lines for UFC

Fight Night on March 19th, 2016

Page 12: Project Paper - Final Deliverable

Model Odds Vegas Odds

8/12 66.7% 6/12 54.5%

Table 3

Our model produces an overall winning percentage average of 62.6% in the last six title fights

while Vegas odds produce an average of 60.9%. As shown in Appendix III, putting this average

against our internal line odds produces betting profits at an average of 28.6%. Through the

evaluation of the model, we believe that this data and its corresponding model outputs can help

solve our business problem, by allowing sportsbook analysts to make an easier decision on

which fighter to choose. Table 4 is an example of the deployment of the model. Note that both

our model and the Vegas lines were incorrect for fight two.

Fighter Estimate Result Vegas Odds

Fight 1: Leslie Smith 3.80 W -130

Rin Nakai 1.18 110

Fight 2: Viscardi Andrade 3.24 W 102

Richard Walsh 6.14 -122Table 4 (Green = accurate model pick; Yellow = accurate Vegas line pick)

Although the model does beat the average Vegas odds, there are a few improvements that could

be made to the data and its collection. First, this model only collected successful strikes in its

fighter data, versus full statistics on total attempted strikes. This data, with its associated

averages and proportions, could be of significance to this model. Furthermore, the background of

a fighter such as style of fighting, total career length, average fight time per match and a count of

types of win (knockout, decision, submission) could factor in to be very significant. In the future,

multiple sources could be used to not only verify and collect data, but specifically add other

possibly significant variables, making the model more effective.

Page 13: Project Paper - Final Deliverable

Appendix I – Variable Definitions

Variables Description

Fighters Fighter is the person who we have collected data on involving their fight statistics in the UFC. This is a nominal (qualitative) variable and there are a total of 1847 fighters in our data set. Due to the nature of this variable we did not run descriptive statistics.

Strikes (Str) Strikes is the total number of punches a fighter has landed in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 66 with an IQR of 148. There aren’t any missing values for this variable. This variable does not appear to pose any significant problems. This variable was used to determine proportion of strikes out of total fights.

Takedowns (Td) Takedowns is the total of number of times a fighter has taken an opponent down in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that takedowns comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 2 with an IQR of 7. There aren’t any missing values for this variable. This variable was used to determine proportion of takedowns out of total fights.

Submission (Sub)

Submissions is the total number of times a fighter has caused an opponent to tap using some type of submission technique in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that submissions comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 1 with an IQR of 3. There aren’t any missing values for this variable. This variable was used to determine proportion of submissions out of total fights.

Pass (Pass)

Pass is the total number of times a fighter has used a grappling technique that signifies an advance or change to a dominant position within a fight in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that pass comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 2 with an IQR of 7. There aren’t any missing values for this variable. This variable was used to determine proportion of passes out of total fights.

Height (Ht) Height is how tall the fighter is in inches. This is a continuous (quantitative) variable. The descriptive statistics show that height comes from a normal distribution. There are a few outliers but the data does not appear to be skewed. The mean is 70.5 with a standard deviation of 3.34. There are 6 values missing for this variable. This variable was highly correlated with imputed reach, thus it was removed from the model.

Weight (Wt) Weight is how heavy the fighter is in pounds. This is a continuous (quantitative) variable. The descriptive statistics show that weight comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 170 with an IQR of 31. There are 7 values missing for this variable. This variable did not change much with a log transformation, thus it

Page 14: Project Paper - Final Deliverable

was kept in the model without a transformation.

Reach Reach is the distance in inches from fingertip to fingertip when a fighter spreads their arms out. This is a continuous (quantitative) variable. The descriptive statistics show that reach comes from a normal distribution. There aren’t any outliers and the data does not appear to be skewed. The mean is 71.81 with a standard deviation of 3.98. There are 774 values missing for this variable. This variable does pose an issue since there are a significant number of values missing. Therefore an imputed reach was calculated to correct for the missing values.

Imputed Reach Imputed reach is a fighter’s height plus two inches. This is a continuous (quantitative) variable. The descriptive statistics show that reach comes from a normal distribution. There aren’t any outliers and the data does not appear to be skewed. The mean is 71.61 with a standard deviation of 5.55. There aren’t any values missing for this variable. This variable does not appear to pose any significant problems.

Stance Stance is the position that a fighter keeps their hands when they fight. Orthodox is fighting with the right hand back and southpaw is fighting with the left hand back. Switch means the fighter does not consistently fight orthodox or southpaw but rather switches between the two. Sideways and open stance are other variations of stance but they are not commonly used. Stances were combined into three categories: orthodox, southpaw, and other. This is a nominal (qualitative) variable. The descriptive statistics show that 71.74% of the fighters fight orthodox, 16.13% fight southpaw, and 12.13% are other or unknown. This variable has 3 missing. This variable does not appear to pose any significant problems.

Wins Wins is the total number of times a fighter has won in their career with the UFC. This is a continuous (quantitative) variable and it is our target variable. The descriptive statistics show that wins comes from a normal distribution, though there are some outliers and it appears right skewed. The median is 13.57 with an IQR of 10. There is 1 value missing for this variable. This variable does not appear to pose any significant problems.

Losses Losses is the total number of times a fighter has lost in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that losses comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 5.83 with an IQR of 5. There is 1 value missing for this variable. This variable does not appear to pose any significant problems.

Draws Draws is the total number of times a fighter has been involved in a fight that did not have a winner in their career with the UFC. This is a continuous (quantitative) variable. The descriptive statistics show that draws comes from a normal distribution, though there are some outliers and it appears right-skewed. The median is 0.38 with an IQR of 0. There are no value missing for this variable. This variable does not appear to pose any significant problems.

Proportion of Wins (Prop Win)

Proportion of wins is a percentage of wins versus overall fights (wins, losses, and draws) in their career with UFC. This is a continuous (quantitative) variable and it is our target variable. The descriptive statistics show that wins comes from a normal distribution, though there are

Page 15: Project Paper - Final Deliverable

some outliers and it appears left skewed. The median is .7 with an IQR of .17. There are 2 values missing for this variable. This variable does not appear to pose any significant problems. This variable was ran with residuals changing the median to .02 and IRQ to .16.

Proportion of Strikes (Prop Str)

Proportion of strikes is the average number of successful strikes that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is 4.467 with an IQR of 8.27. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transform the median is .7 with an IRQ of .7, the distribution is normal and there are very few outliers. The log transformation of this variable was ran with residuals, the median is 1.7 with an IRQ of 1.46.

Proportion of Takedowns (Prop Td)

Proportion of takedowns is the average number of successful takedowns that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .14 with an IQR of .41. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transform the median is 1.7 with an IRQ of 1.46. The distribution is more normal with the log transformation, there are still some outliers and it is still right skewed. The log transformation of this variable was ran with residuals, the median is .13 with an IRQ of .35.

Proportion of Submission (Prop Sub)

Proportion of submission is the average number of successful submissions that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .06 with an IQR of .19. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transformation the median is .06 with an IRQ of .17. The distribution is more normal with the log transformation, there are still outliers and it is still right skewed. The log transformation of this variable was ran with residuals, the median is .05 with an IRQ of .17.

Proportion of Pass (Prop Pass)

Proportion of pass is the average number of successful passes that a fighter has per fight. This is a continuous (quantitative) variable. The descriptive statistics show that proportion of strikes comes from a normal distribution, though there are many outliers and it appears right-skewed. The median is .13 with an IQR of .43. There are 2 missing values for this variable. This variable does not appear to pose any significant problems. The log plus 1 of this variable was used in the analysis. With the log transformation the median is .12 and the IRQ is .36. The distribution is more normal with the log transformation, there are still outliers and it is a bit right skewed. This variable was not used since it was highly correlated with proportion of takedowns.

Page 16: Project Paper - Final Deliverable

Appendix II – JMP OutputsJMP Output #1

Page 17: Project Paper - Final Deliverable
Page 18: Project Paper - Final Deliverable
Page 19: Project Paper - Final Deliverable
Page 20: Project Paper - Final Deliverable

JMP Output #2

Page 21: Project Paper - Final Deliverable
Page 22: Project Paper - Final Deliverable

JMP Output #3

Page 23: Project Paper - Final Deliverable

JMP Output #4

Page 24: Project Paper - Final Deliverable
Page 25: Project Paper - Final Deliverable
Page 26: Project Paper - Final Deliverable

JMP Output #5

Page 27: Project Paper - Final Deliverable

JMP Output #6

Page 28: Project Paper - Final Deliverable
Page 29: Project Paper - Final Deliverable

JMP Output #7

Page 30: Project Paper - Final Deliverable

JMP Output #8

Page 31: Project Paper - Final Deliverable

JMP Output #9

Page 32: Project Paper - Final Deliverable
Page 33: Project Paper - Final Deliverable

JMP Output #10

Page 34: Project Paper - Final Deliverable
Page 35: Project Paper - Final Deliverable

Appendix III – Model Deployment Analysis

(See Excel Attachment)