March data crunch madness
-
Upload
jing-fan -
Category
Data & Analytics
-
view
345 -
download
1
Transcript of March data crunch madness
March Data Crunch Madness
Yixi Zhang • Jing Fan • Hailing Li • Yiyi ChenTeam Jordan
2
Introduction
Predicting March Madness is a national obsession in the US. We, on Team Jordan, analyzed the data given; utilizing SPSS, Tableau, and Excel, and put our own spin on predicting the outcomes based on our findings.We pre-processed the data, built several models, and chose the one with the highest accuracy percentage. We analyzed the results and created charts that would give meaningful representations.
NCAA3
Founded: 1939No. of teams: 68Regions: Four RegionsTournament statistics: Low seeded teams, Highly seeded teams
Analytics● What are the correlations between these independent variables?
● Does the distance between one team’s home campus and the location of the game have positive or negative correlation with the game result?
● What are those most important factors that will impact the result of a game?
● What are the probabilities of each team’s winning or losing?
4
Descriptive Analysis5
Tool: Tableau
Possible Correlation between Location and Ranking of Team Temp
Location of the game has impact on the team temp. This chart indicates the ranking of each team in the same location where the game was played.
Descriptive Analysis Cont.6
Tool: Tableau
Possible correlation between Seed Difference and Team Adj Temp
The Seed Difference has a huge impact on team performance. (Blue line)
The Team Adj Temp varies between different teams. (Orange line)
Descriptive Analysis Cont.7
Tool : Tableau
Possible correlation between Dependent variable Result (Lose & Win) and other independent variables -- Seed Difference(S-R), Distance, Team_adj_ tempo, Team_off_eff
Distance has a strong influence on both Win and Lose. Team Off Eff has some influence while Score Diff(S-R) has slightly impact.
Data Pre-processing● Split one record into two records for both teams participated in that game
● Calculate seed difference (Rival team seed- team seed)
● Calculate distance between game location and team location using the longitude and latitude of both places
● Define dependent variable “result” as win/lose which means whether the team win or lose that game
● Review each independent variables’ relationship with our target variable and remove irrelevant variables
● Review numeric attributes. Make sure there are no correlations between all numeric attributes.
8
Variable Selection9
Dependent Variable Result (win/lose)
Independent Variables
Seed_Diff(R-S) Team_adj_off_eff Team_ftpct Team_oppblockpct Team_stlrate
Distance(Km) Team_def_eff Team_blockpct Team_f3grate Team_oppstlrate
Team_tempo Team_adj_def_eff Team_oppfg2pct Team_oppf3grate
Team_adj_tempo Team_fg2pct Team_oppfg3pct Team_arate
Team_off_eff Team_fg3pct Team_oppftpct Team_opparate
Model
Three Testing Models
● Tool: IBM SPSS Advanced Modeler● Models:Neural Network, Decision Tree,
Bayes Network● Training Partition: testing partition= 7:3
When building the Neural Network model, confidence was chosen based on the probability of predicted value. Bagging (a feature in SPSS) was also chosen to enhance the model’s stability.
Accuracy (%)
Neural Network
Decision Tree
Bayes Network
Training Data
87.48 73.16 71.53
Testing Data 68.97 68.01 72.22
● We adopted the model built by Neural Network
10
Model Cont.
Make predictions
11
Final Output
The percentage indicates the likelihood the first team in the key will win
Top 5 important predictors:● Seed_Diff(R-S)● Team_adj_off_eff● Team_off_eff● Team_adj_def● Team_adj_def_eff
Predictor importance from Neural Network
Results & Conclusion
“MVP Variables” - top five most important independent variables
● Seed Difference seed number difference between a team and it’s rival
● Team’s adjusted offensive efficiencypoints scored per 100 possessions a team would have against the average D-I defense
● Team’s offensive efficiencypoints scored per 100 offensive possessions
● Team’s adjusted defensive efficiencypoints allowed per 100 possessions a team would have against the average D-I offense
● Team’s defensive efficiencypoints allowed per 100 defensive possessions
12
Results & Conclusion Cont.
SuggestionsThe addition of the following data would have been helpful:● Individual player data and injury data● Individual regular season game data● Coach data to expand the data set● More previous seasons’ data to expand the data set● Other temporal factors that may influence the game such as:
o Rule changeso NBA Draft (thirty teams from the NBA can draft players who are
eligible and wish to join the league) o Upset testing (a victory by an underdog team)
13