Haal meer IBM SPSS Statistics 11.11.14 Voorspellen aan de hand van logistische regressie...

26
IBM SPSS presentation Amsterdam, 11 th November 2014 Drs. Ing. J.A.C.M. Smit (Jan) Director of STATSCONsult, based in Drunen NL 11/11/2014 STATSCONsult, Logistic Regression, IBM SPSS presentation

description

Voorspellen aan de hand van logistische regressie STATSCONsult - Jan Smit

Transcript of Haal meer IBM SPSS Statistics 11.11.14 Voorspellen aan de hand van logistische regressie...

Page 1: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

IBM SPSS presentation

Amsterdam, 11th November 2014

Drs. Ing. J.A.C.M. Smit (Jan)

Director of

STATSCONsult, based in Drunen NL

11/11/2014 STATSCONsult, Logistic Regression, IBM SPSS presentation

Page 2: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

STATSCONsult

Support, marketing and Sales of software

products for statistical analyses

Courses in Statistics

Consultancy in Data Analyses

Jan Smit worked for SPSS from 1984 until

1989.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 3: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

STATSCONsult Consultancy

SPSS Intro courses

SPSS assistance in data analyses

SPSS advanced courses

SPSS Risk Analyses (including Weight of

Evidence)

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 4: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Examples of Logistic Regression

We wish to model the likelihood of an event

that is likely to happen which depends on a

number of factors (predictors):

◦ To predict whether a patient has (or will have) a

given disease

◦ Prediction of a customer's propensity to

purchase an appliance (TV)

◦ Prediction of passing an exam

◦ Prediction of paying back a loan in full

◦ Risk analyses is done with Logistic Regression

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 5: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

What are the assumptions of using

Logistic Regression? The predictors are not too much highly multiple

correlated (multicollinearity)

A continuous predictor should have a monotone descending (ascending) probability of the dependent variable in the data

We obtain a (model + error); residuals (=error) should not dominate

Model should be interpretable, easy to use and be useful for forecasting

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 6: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Logistic regression, application

We analyse the effect of a number of

independent predictors (x1, x2, .. xn) on a

dependent variable Y, where Y in [0,1]

Covariates are predictors, for which we

wish to correct (such as age)

Predictors can be continuous, nominal or

ordinal

◦ Independent variables can be continuous, Age

◦ Ordinal or Nominal, Level of Education,

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 7: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Data

1500 observations

We wish to have a model of Previous Default (Y=1)

From now on, we say Previous Default (risk to pay-off bankloan) =“Risk”

Interpret the model, use the model for prediction.

Based on the predictors (Age, .. ,Household Income)

548 observations have Risks (Y=1) in our data set

Here 90% of observations are used for the model; the remaining observations are used for prediction.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 8: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

What are my odds? We cannot use regression, though we use all the theory of

linear regression.

In Logistic regression our model is:

◦ log(P(y=1)/P(y=0) )= a + b1*x1 + b2*x2 + ..+ bn*xn

◦ Linear regression : Y= a + b1*x1 + b2*x2 + ..+ bn*xn (nearly the same)

◦ Odds : P(y=1), P(y=0) and P(y=1)/P(y=0) ; my odds are 2 to 1, meaning P(y=1)/P(y=0)=2

Log(odds) makes statistics possible :

◦ P=2/3: odds ratio= 2; log(2)=0,69

◦ P=1/2: odds ratio=1; log(1)=0

A coefficient is the change in the log odds, when other factors are fixed.

Sometimes I have the Odds against, or Odds on, or Odds even.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 9: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Bankloan data

We wish to model the chance of paying back in full a bank loan. When Risk =1, the loan was not in the end returned to the bank.

Y : Risk{1=yes, 0=no}

X : A number of factors that may affect Y

Age in years age

Level of education ed

Years with current employer employ

Years at current address address

Household income in thousands income

Debt to income ratio (x100) debtinc

Credit card debt in thousands creddebt

Other debt in thousands othdebt

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 10: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Make groups via visual binning

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 11: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Odd ratios of risk decreases with

higher values for age

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 12: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Dependency of Age on Risk

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 13: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

AND IN FORMULAS

We read : Log Odds = log(Risk=1 / Risk=0)

= Constant + B * age =

1,250 – 0,055*age

For age=20 : LogOdds = 1,25-1,1= 0,15

For age=30 : LogOdds = 1,25- 1,65 = -0,4

For age=40 : LogOdds = 1,25 – 2,2 = -0,95

If age= 22,7 the LogOdds=0

According to model :

For age=20 : OddsRatio = exp(0,15) =1,15

For age=30 : OddsRatio = exp(-0,4) = 0,67

For age=40 : OddsRatio = exp(-0,95)= 0,4

Probability :

For age=20 : P(yes)= 0,54

For age=30 : P(Y=1) = 0,40

For age=40 : P(Y=10 = 0,28

We conclude Age can be used as an

predictor for Risk (as Sig-P < 0,05)

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 14: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Usage of dialog in IBM SPSS:

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 15: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Specify categorical predictors

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 16: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Output (2) from initial stage (all

main effects)

The -2 log likelihood is leading.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 17: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Output of predictors and effect on

Risk

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 18: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

What predictors can we use in the

model to estimate “Risk” If the Sig.-p< 0,05 for a predictor, we may

conclude that this predictor has an effect on the depended variable (a significance effect).

If the Sig.-p>0,05 for a predictor, we may conclude that we are uncertain that this predictor has an effect on the depended variable.

Watch out for pit falls (remove a variable that has no effect, and re-estimate).

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 19: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Modelling

By using Backwards (LR), at each step we

re-estimate the model, leaving out a non-

significant predictor:

After this step : only the variables: age,

employ, debtinc, creddebt are significant

Note that correlations of predictors may

affect the order of inclusion in model (employ and address are highly correlated)

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 20: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

After Backward Deletion

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 21: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Interpretation of the model

If the coefficient of a predictor < 0 the odd ratio decreases for larger values.

Large coefficients (positive or negative) are more important (go with large Wald statistics and small Sig.-p values).

Here people with 1. short period at current employer (Change) and

2. high Credit Card Debts (Expenders) and

3. high values of Debts to Income ratio (Have Fun) and

4. low ages (Young)

show high risk.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 22: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Classification on the data in model;

If we adjust the cut value to a lower p (0,5) the

Predicted Yes column values becomes lower.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 23: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Model expression

The model is:

Log(Risk=1/ Risk=0) = -0,133 (constant)

- 0,213 * employ (from 0 to 50)

+ 0,483 * creddebt (from 0 to 36)

+ 0,102 * debtinc (from 0 to 40)

- 0,040 * age (from 18 to 60)

If this expression > 0, the probability > 0,5

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 24: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Prediction

Prediction is rather good (102 out of 133)

Make use of the model and apply this to the remaining observations that are not included in the model.

65+ 15 were formally classified as “No Risk”

65 +16 are selected in model as “No Risk”

We are able to change the cut off value of 0,5

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 25: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Comparison Classification Trees and

Logistic Regression If the number of variables is high the

result of LR still is simple; CT output will

become large and complex.

CT finds interactions, segments, with

highest P. With LR segments are

determined with high probability.

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation

Page 26: Haal meer IBM SPSS Statistics 11.11.14   Voorspellen aan de hand van logistische regressie STATSCONsult

Vragen

Jan Smit

[email protected]

+31 416 378 125

http://www.statsconsult.nl/

11/11/2014

STATSCONsult, Logistic Regression, IBM

SPSS presentation