March data crunch madness

March Data Crunch Madness

Yixi Zhang • Jing Fan • Hailing Li • Yiyi ChenTeam Jordan

2

Introduction

Predicting March Madness is a national obsession in the US. We, on Team Jordan, analyzed the data given; utilizing SPSS, Tableau, and Excel, and put our own spin on predicting the outcomes based on our findings.We pre-processed the data, built several models, and chose the one with the highest accuracy percentage. We analyzed the results and created charts that would give meaningful representations.

NCAA3

Founded: 1939No. of teams: 68Regions: Four RegionsTournament statistics: Low seeded teams, Highly seeded teams

Analytics● What are the correlations between these independent variables?

● Does the distance between one team’s home campus and the location of the game have positive or negative correlation with the game result?

● What are those most important factors that will impact the result of a game?

● What are the probabilities of each team’s winning or losing?

4

Descriptive Analysis5

Tool: Tableau

Possible Correlation between Location and Ranking of Team Temp

Location of the game has impact on the team temp. This chart indicates the ranking of each team in the same location where the game was played.

Descriptive Analysis Cont.6

Tool: Tableau

Possible correlation between Seed Difference and Team Adj Temp

The Seed Difference has a huge impact on team performance. (Blue line)

The Team Adj Temp varies between different teams. (Orange line)

Descriptive Analysis Cont.7

Tool : Tableau

Possible correlation between Dependent variable Result (Lose & Win) and other independent variables -- Seed Difference(S-R), Distance, Team_adj_ tempo, Team_off_eff

Distance has a strong influence on both Win and Lose. Team Off Eff has some influence while Score Diff(S-R) has slightly impact.

Data Pre-processing● Split one record into two records for both teams participated in that game

● Calculate seed difference (Rival team seed- team seed)

● Calculate distance between game location and team location using the longitude and latitude of both places

● Define dependent variable “result” as win/lose which means whether the team win or lose that game

● Review each independent variables’ relationship with our target variable and remove irrelevant variables

● Review numeric attributes. Make sure there are no correlations between all numeric attributes.

8

Variable Selection9

Dependent Variable Result (win/lose)

Independent Variables

Seed_Diff(R-S) Team_adj_off_eff Team_ftpct Team_oppblockpct Team_stlrate

Distance(Km) Team_def_eff Team_blockpct Team_f3grate Team_oppstlrate

Team_tempo Team_adj_def_eff Team_oppfg2pct Team_oppf3grate

Team_adj_tempo Team_fg2pct Team_oppfg3pct Team_arate

Team_off_eff Team_fg3pct Team_oppftpct Team_opparate

Model

Three Testing Models

● Tool: IBM SPSS Advanced Modeler● Models:Neural Network, Decision Tree,

Bayes Network● Training Partition: testing partition= 7:3

When building the Neural Network model, confidence was chosen based on the probability of predicted value. Bagging (a feature in SPSS) was also chosen to enhance the model’s stability.

Accuracy (%)

Neural Network

Decision Tree

Bayes Network

Training Data

87.48 73.16 71.53

Testing Data 68.97 68.01 72.22

● We adopted the model built by Neural Network

10

Model Cont.

Make predictions

11

Final Output

The percentage indicates the likelihood the first team in the key will win

Top 5 important predictors:● Seed_Diff(R-S)● Team_adj_off_eff● Team_off_eff● Team_adj_def● Team_adj_def_eff

Predictor importance from Neural Network

Results & Conclusion

“MVP Variables” - top five most important independent variables

● Seed Difference seed number difference between a team and it’s rival

● Team’s adjusted offensive efficiencypoints scored per 100 possessions a team would have against the average D-I defense

● Team’s offensive efficiencypoints scored per 100 offensive possessions

● Team’s adjusted defensive efficiencypoints allowed per 100 possessions a team would have against the average D-I offense

● Team’s defensive efficiencypoints allowed per 100 defensive possessions

12

Results & Conclusion Cont.

SuggestionsThe addition of the following data would have been helpful：● Individual player data and injury data● Individual regular season game data● Coach data to expand the data set● More previous seasons’ data to expand the data set● Other temporal factors that may influence the game such as:

o Rule changeso NBA Draft (thirty teams from the NBA can draft players who are

eligible and wish to join the league) o Upset testing (a victory by an underdog team)

13

http://en.wikipedia.org/wiki/Draft_(sports)

http://en.wikipedia.org/wiki/Eligibility_for_the_NBA_draft

March data crunch madness

Data & Analytics

Transcript of March data crunch madness