Data Analytics Project_Eun Seuk Choi (Eric)

21
Eun Seuk Choi (Eric) Statistical Methods & Data Analytics Final Project Professor Alan Huebner December 10, 2015 <Analysis on NBA Real Plus-Minus for 2014-2015 Regular Seasons>

Transcript of Data Analytics Project_Eun Seuk Choi (Eric)

Page 1: Data Analytics Project_Eun Seuk Choi (Eric)

Eun Seuk Choi (Eric)

Statistical Methods & Data Analytics

Final Project

Professor Alan Huebner

December 10, 2015

<Analysis on NBA Real Plus-Minus for 2014-2015 Regular Seasons>

Table of Contents

Page 2: Data Analytics Project_Eun Seuk Choi (Eric)

1. Introduction

a. Describe data

b. About variables

c. Purpose of analysis

2. Data

a. More details about data

b. The source of data

3. Regression Analysis

a. Exploratory data analysis

i. Scatterplots of each of X variables vs. Y variable

ii. Most highly correlated X variables

b. Linear Regression Analysis

i. Fit a full model and report theR2

ii. Conduct one F-test to test for the removal of a subset of variables

iii. Use all stepAIC()

iv. Find outliers

v. Choose the “final” model

vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)

vii. Validate the model by cross validation or bootstrapping

4. Results

a. Three inferences about the final model and importance of each inference

i. A confidence interval for a fitted value

ii. A prediction interval for a fitted value

iii. A confidence interval for one or more slope parameters

5. Conclusion

a. How well the model describes Y variable

b. Factors that can improve the predictive power of the model

1. Introduction

Page 3: Data Analytics Project_Eun Seuk Choi (Eric)

a. Describe data

The file “NBA real plus-minus for 2014-2015 regular seasons” contains the data

extracted from ESPN.com, about an individual NBA player’s influence on his team’s wins by

analyzing the number of games he played during the season, the number of minutes he

played on each game, on-court team offensive performance, and on-court team defensive

performance. Data consists of 474 NBA players who played for at least one game for 2014-

2015 regular seasons.

b. About variables

There are 5 variables in total: GP, M, ORPM, DRPM, and WINS. While WINS is the

response variable, all the other 4 variables are predictor variables. GP is the number of

games played for 2014-2015 regular seasons out of 82 games. M is minutes per game for

each player. ORPM is player’s estimated on-court impact on team’s offensive performance,

measured in points scored per 100 offensive possessions, while DRPM is player’s estimated

on-court impact on team’s defensive performance. WINS provides an estimate of the

number of wins each player has contributed to his team’s win total on the season. WINS

includes the player’s Real Plus-Minus and his number of possessions played.

c. Purpose of analysis

By interpreting the result of linear regressions on those 5 variables (WINS for the

response variable and the other four for predictor variables), I want to find out primary

factors that positively affect WINS. I will find the optimal model to predict WINS by

conducting F-test to remove a subset of variables from the model, observe outliers within

data, perform model diagnostics on my final model, and validate it using cross validation.

Page 4: Data Analytics Project_Eun Seuk Choi (Eric)

Based on these interpretations, I will make inferences pertinent to my topic by using

combinations of a confidence interval for a fitted value and a confidence interval for slope

parameters. With an evaluation about my final model, I will finish this project by finding a

way to improve the predictive power of the model.

2. About Data

a. More details about data

The data was extracted from ESPN.com website. Original data includes 6 variables,

which include 5 variable mentioned above plus RPM, but I excluded it since RPM is just

ORPM+DRPM. RPM has a perfect correlation with ORPM+DRPM, so there is no need to

include RPM on my model.

b. The source of data

The source of data is Basketball-Reference.com. It provided play-by-play data to

ESPN and Data Analysts on ESPN assembled play-by-play data to construct ORPM and

DRPM data with their own ways for 2014-2015 regular seasons. According to ESPN, the

ORPM and DRPM model sifts through more than 230,000 possessions each NBA season to

tease apart the real plus-minus effects attributable to each player.

3. Regression Analysis

a. Exploratory data analysis

i. Scatterplots of each of X variables vs. Y variable

RPM1<-read.table("NBARPM.txt",header=T)

attach(RPM1)

Page 5: Data Analytics Project_Eun Seuk Choi (Eric)

0 20 40 60 80

-50

510

1520

GP

WIN

S

0 10 20 30 40

-50

510

1520

M

WIN

S

-4 -2 0 2 4 6 8

-50

510

1520

ORPM

WIN

S1) plot(GP,WINS) 2) plot(M,WINS)

3) plot(ORPM,WINS)

ii. Most highly correlated X variables

cor(cbind(GP,M,ORPM,DRPM))

According to the correlation matrix, GP and M are most highly correlated X variables (With

cor = 0.66)

b. Linear Regression Analysis

i. Fit a full model and report the R2

mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)

Page 6: Data Analytics Project_Eun Seuk Choi (Eric)

summary(mod.RPM)

R^2 = 0.8575, Adjusted R^2 = 0.8563

ii. Conduct one F-test to test for the removal of a subset of variables

Given mod.RPM is a full model, I want to find out if the set of three variables, M,

ORPM, DRPM can be removed in my model by conducting F-test for comparing nested

models.

mod.reduced<-lm(WINS~GP)

summary(mod.reduced)

SSE.r<-sum(mod.reduced$residuals^2)

SSE.c<-sum(mod.RPM$residuals^2)

F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))

#F=763.4384

pf(763.4384,3,470,lower.tail=F) # very low p-value

Given very low p-value, we reject the null and cannot remove a group of 3 predictors from

the model.

iii. Use all stepAIC()

library(MASS)

optimal.bp <- stepAIC(mod.RPM)

optimal.bp$anova

Page 7: Data Analytics Project_Eun Seuk Choi (Eric)

-5 0 5 10

-4-2

02

46

mod.RPM$fitted.values

mod

.RP

M$r

esid

uals

Initial Model : WINS~GP+M+ORPM+DRPM

Final Model : WINS~GP+M+ORPM+DRPM

iv. Find outliers

rstandard(mod.RPM)

I found out that two players, Draymond Green (121st value) and Stephen Curry (421st

value), have z-score >3. They are outliers.

v. Choose the “final” model

I chose the intial model (WINS~GP+M+ORPM+DRPM) to be the final model since it

has the highest adjusted R^2 among combinations of other variables.

Adjusted R^2 for WINS~M+GP+ORPM+DRPM = 0.8563

Adjusted R^2 for WINS~M+GP+ORPM = 0.6335

Adjusted R^2 for WINS~M+GP+DRPM = 0.5572

Adjusted R^2 for WINS~GP+ORPM+DRPM = 0.8502, and so on. The initial model

has the highest adjusted R^2. In addition, according to StepAIC function, the initial model is

the optimal model for this data.

vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)

plot(mod.RPM$fitted.values,mod.RPM$residuals)

Page 8: Data Analytics Project_Eun Seuk Choi (Eric)

-5 0 5 10

-4-2

02

46

mod.RPM$fitted.values

mod

.RP

M$r

esid

uals

-5 0 5 10

-4-2

02

46

mod.RPM3$fitted.values

mod

.RP

M3$

resi

dual

s

Since the plot does not have a random pattern, I changed the model, reflecting the

result in plot(GP,WINS) and plot(M,WINS). Since those two plots have quadratic pattern, I

tried with GP^2 and M^2 for the new model.

GP1<-GP^2

M1<-M^2

mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)

summary(mod.RPM3)

plot(mod.RPM3$fitted.values,mod.RPM3$residuals)

However, I got the similar graph as above, meaning that the assumption that

residuals are normal might not hold for my model. Additionally, I tried to obtain a better

plot by trying quadratic, log, exponential transformation on my parameters, but I could not

Page 9: Data Analytics Project_Eun Seuk Choi (Eric)

-3 -2 -1 0 1 2 3

-4-2

02

46

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Histogram of mod.RPM$residuals

mod.RPM$residuals

Freq

uenc

y

-4 -2 0 2 4 6 8

050

100

150

find a better one than the original model. Therefore, I decided to stick with my original

model.

qqnorm(mod.RPM$residuals)

On the other hand, qqnorm(mod.RPM$residuals) has approximately linear

increasing function (nearly straight line), which indicates that residuals might be normal.

hist(mod.RPM$residuals)

In addition, the histogram of residuals has approximately a bell shape, supporting

the claim that residuals are normal.

Page 10: Data Analytics Project_Eun Seuk Choi (Eric)

vii. Validate the model by cross validation

Using cross validation (code is attached on Appendix), rsquared.Group2 = 0.851 and

rsquared.Group1=0.845. Since the mean of those two values = 0.848 is close to the

R^2=0.8575 of the final model, I concluded that this model is valid.

4. Results

a. Three inferences about the final model and importance of each inference

i. A confidence interval for a fitted value

I chose to compute a 95% confidence interval for the mean WINS for all players who

has ORPM=0, DRPM=0, M=20.43, GP=54.29.

predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi

dence",level=0.95)

The result demonstrates that the mean WINS for all players having ORPM=0,

DRPM=0, M=20.43, and GP=54.29, falls within [2.705, 2.968] with 95% confidence. I chose

ORPM=0, DRPM=0 since it is greater than mean(ORPM)=-0.646 and mean(DRPM)=-0.278

and it is where each player breaks even (when ORPM=DRPM=0) in his offensive and

defensive contribution to the team. Mean(M)=20.43 and mean(GP)=54.29 were chosen for

fitted values for M and GP so that I can better compare WINS value with ORPM and DRPM

values. I can conclude that the mean WINS for all players with ORPM=0, DRPM=0, and

average for GP and M values, who performs better than the average on both ends of the

floor, falls within [2.705, 2.968].

ii. A prediction interval for a fitted value

I chose to compute a 95% prediction interval for the mean WINS for a “new"player

who has ORPM=0, DRPM=0, M=20.43, GP=54.29.

predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="pred

iction",level=0.95)

Page 11: Data Analytics Project_Eun Seuk Choi (Eric)

The result indicates that the mean WINS for a “new” players having ORPM=0,

DRPM=0, M=20.43, and GP=54.29, falls within [0.183,5.490] with 95% confidence. For the

same reason as the confidence interval for a fitted value, I chose ORPM=0, DRPM=0,

M=20.43, and GP=54.29. I can conclude that the mean WINS for a new player with ORPM=0,

DRPM=0, and average for GP and M values, who performs better than the average on both

ends of the floor, falls within [0.183,5.490].

iii. A confidence interval for one or more slope parameters

I chose to compute a 95% confidence interval for the ORPM variable.

Lower = 1.142995-1.96*0.03653 = 1.071396

Upper= 1.142995+1.96*0.03653 = 1.214594

Therefore, I am 95% confident that the value of ORPM falls within [1.071396,

1.214594]. Since this interval does not contain 0, I can conclude that ORPM variable is a

significant predictor of this model. This can also be verified with the low p-value for the

ORPM variable.

5. Conclusion

a. How well the model describes Y variable.

In general, I found that my model satisfactorily describes my response variable

(WINS), as the model has about 0.85 value for R^2 and adjusted R^2. Especially, results are

consistent with my intuition that WINS increases as GP, M, ORPM, and DRPM increase, but

the amount of increase in WINS is the most significantly affected by ORPM and DRPM, as

they have bigger slopes than GP and M.

b. Factors that can improve the predictive power of the model

It would be better if I could find a model having a random pattern on its fitted values

vs. residuals plot. I manipulated some of my predictor variables to fit the better model, but I

could not find the better model than my final model.

Page 12: Data Analytics Project_Eun Seuk Choi (Eric)

Additionally, if I had PER (Player Efficiency Rating) as one of my predictor variables,

the predictive power of the model might have increased, since PER also has a positive

correlation with WINS. If I find that PER variable does not have significant correlation with

my original predictor variables, I would be able to interpret how each player’s performance

affects WINS better with PER added as an additional variable.

<Appendix>

#attach data

RPM1<-read.table("NBARPM.txt",header=T)

Page 13: Data Analytics Project_Eun Seuk Choi (Eric)

attach(RPM1)

#scatterplots of each of predictor variables

plot(GP,WINS)

plot(M,WINS)

plot(ORPM,WINS)

#correlation matrix among predictor variables

cor(cbind(GP,M,ORPM,DRPM))

#Fit full model using all X’s and report R^2

mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)

summary(mod.RPM)

#Use a reduced model to conduct F-test

mod.reduced<-lm(WINS~GP)

summary(mod.reduced)

SSE.r<-sum(mod.reduced$residuals^2)

SSE.c<-sum(mod.RPM$residuals^2)

F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))

F

pf(763.4384,3,470,lower.tail=F)

Page 14: Data Analytics Project_Eun Seuk Choi (Eric)

#pf value=1, which seems wrong here. However, if you turn off R, reopen, and paste the

code pf(763.4384,3,470,lower.tail=F), you will get 3.28*e^-180, the true value. Thank you!

#Use StepAIC to find final (optimal) model

library(MASS)

optimal.bp <- stepAIC(mod.RPM)

optimal.bp$anova # display results

#Find out outliers

rstandard(mod.RPM)

#model diagnostics (fitted values vs residuals and normal plot)

plot(mod.RPM$fitted.values,mod.RPM$residuals)

qqnorm(mod.RPM$residuals)

hist(mod.RPM$residuals)

#model diagnostics with changed model (fitted values vs residuals and normal plot)

GP1<-GP^2

M1<-M^2

mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)

summary(mod.RPM3)

plot(mod.RPM3$fitted.values,mod.RPM3$residuals)

Page 15: Data Analytics Project_Eun Seuk Choi (Eric)

qqnorm(mod.RPM3$residuals)

#validate the final model by using cross validation

set.seed(5)

#obtain total sample sizen<-dim(RPM1)[1]Group1.index<-sample(1:n,round(n/2),replace=F)Group2.index<-setdiff(1:n,Group1.index)Group1<-RPM1[Group1.index,]Group2<-RPM1[Group2.index,]#Fit a linear model on Group1 and a separate one on Group2mod.Group1<-lm(WINS~GP+M+ORPM+DRPM,data=Group1)mod.Group2<-lm(WINS~GP+M+ORPM+DRPM,data=Group2)

###Compute fitted values on Group2 using model fit on Group1fitted.Group2<-NULLfor (i in 1:dim(Group2)[1]){

fitted.Group2<-c(fitted.Group2,(mod.Group1$coef[1]+mod.Group1$coef[2]*Group2$GP[i] +mod.Group1$coef[3]*Group2$M[i]

+mod.Group1$coef[4]*Group2$ORPM[i]

+mod.Group1$coef[5]*Group2$DRPM[i]

))}

##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo)rsquared.Group2 <- 1 - sum((Group2$WINS-fitted.Group2)^2)/sum((Group2$WINS-mean(Group2$WINS))^2)rsquared.Group2

###Compute fitted values on Group1 using model fit on Group2fitted.Group1<-NULLfor (i in 1:dim(Group1)[1]){

fitted.Group1<-c(fitted.Group1,(mod.Group2$coef[1]+mod.Group2$coef[2]*Group1$GP[i] +mod.Group2$coef[3]*Group1$M[i]

+mod.Group2$coef[4]*Group1$ORPM[i]

Page 16: Data Analytics Project_Eun Seuk Choi (Eric)

+mod.Group2$coef[5]*Group1$DRPM[i]

))}

##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo)rsquared.Group1 <- 1 - sum((Group1$WINS-fitted.Group1)^2)/sum((Group1$WINS-mean(Group1$WINS))^2)rsquared.Group1

###Compute mean of both R^2mean(c(rsquared.Group2,rsquared.Group1))

#A confidence interval for a fitted value

predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi

dence",level=0.95)

#A prediction interval for a fitted value

predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="prediction",

level=0.95)

# A confidence interval for one or more slope parameters is calculated manually by looking

at summary(mod.RPM)