1
MARCH DATA CRUNCH MADNESS
The Shooting StarsNan (Miya) WangJohn De MartinoPritha SinhaArmi Thassim
2
INTRODUCTION
Background: With 68 college basketball teams competing in a single-elimination tournament, the National Collegiate Athletic Association (NCAA) is played every spring in the US.
Objective: Create an optimized model to predict 2016 NCAA Finals, based on historical regular season data from 2002 to 2015, through applying various machine learning techniques.
Results:
http://shootingstarsnyc.azurewebsites.net/
Above link to our machine learning web API can help you make your own 2016 NCAA Predictions!
3
ANALYSIS KPI
Model Performance Evaluation Metrics
Find a set of predictions that minimizes Log loss.
Penalize heavily being simultaneously confident and wrong.
Balance between being too conservative and too confident.
Actual number of games played in the tournament
Predicted probability that team A beats team B
Actual binaryoutcome of each
4
ANALYSIS PROCESS
Model Evaluation
5
DATA PREPARATION
Feature Transformation and Normalization
Rank to Score Team 1 Adjusted Seed = 0.5 + 0.03 * (Team 2 Seed - Team 1 Seed)
Normalization MinMax Scaler
Derive differences
Team 1 score of an attribute - Team 2 score of an attribute
6
FEATURE SELECTION
Feature Correlation Heatmap
Feature Distribution Histogram
Correlation and Distribution
A few Features have linear Correlation
Most Features are Normal Distributed
Importance Plotting and Recursive Elimination
Log Loss for Different Feature Numbers
Feature Importance
FEATURE SELECTION
7
Optimal Number of Feature: 9
● 97 Features to 9 Features
8
PERFORMANCE VALIDATION
Cross Validation and Different Training Size
Grid Searching/Parameter Tuning
Acceptable Model Performance Variation
Learning Curve
Overfitting when Training Size under
45%
Partition Size: 50% - 50%
9
PERFORMANCE VALIDATIONModel Fusion RF, GBT and Logistic
Regression are Top 3
Majority Voting
Leverage the information gleaned from different methods Minimize the flaws in each model. Increase stability and guarantee accuracy
10
PREDICTION REVIEW
Predicted Prob Distribution for 2016 NCAA
Our model keeps more affirmative on “Gonna Win” Teams while holding ambiguous to “Gonna Lose” Teams.
11
PREDICTION REVIEWPredicted Round of 32 for 2016 NCAA
Our Model Accurately Predicted 25 out of 32.
Accuracy: 78%
12
PREDICTION REVIEW
Our Model Accurately Predicted 12 out of 16.
Accuracy: 75%
Predicted Sweet 16 for 2016 NCAA
13
PREDICTION REVIEWPredicted Elite Eight for 2016 NCAA
Our Model Accurately Predicted 6 out of 8.
Accuracy: 75%
14
INTERESTING ANALYSIS
Top Teams and Cinderella TeamsTop Eight Teams from 2002 to
2015
Detailed performance of eight top teams in each season ?
15
INTERESTING ANALYSIS
Eight Top Teams
UNC Michigan St.
ConnecticutKansas Kentucky Duke
LouisvilleFlorida
Championship Count:1. Connecticut(3 times)2. Duke; UNC; Florida(twice)3. Kansas; Kentucky; Louisville(once)
Years Count:1. Kansas(12 years)2. Duke; UNC; Kentucky(11 years)3. Florida; Michigan St.(10 years)
No Championship: Michigan St.
16
INTERESTING ANALYSIS
Top Teams and Cinderella TeamsMost Frequent “Cinderella”
from 2002 to 2015
We define: In each game, a winning team with higher seed and lower RPI, as Cinderella
Top Teams being Cinderella: Michigan St. Connecticut Kentucky
17
INTERESTING ANALYSIS
Cinderella Teams
We define: In each game, a winning team with higher seed and lower RPI, as Cinderella
Model Prediction for Cinderella
Our model accurately identified all Cinderella.
Mean Score: 80%
18
CONCLUSION
Self Attribute(importance descending) offensive efficiency defensive efficiency block shots Opponent Attribute 2 point field goals shooting 3 point field goals shooting
On Training Dataset: Log_loss: 0.46 Accuracy: 81%
On 2016 Testing Dataset: Accuracy: 75%-78%
Primary Factors for Win-Lose:
Model Accuracy
Outer Factor distance
Useful Indicator RPI seed