Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics and Informatics

Post on 14-Jun-2015

2.018 views 2 download

Tags:

Transcript of Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics and Informatics

© Experian Limited 2007. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Limited. Other product and company names mentioned herein may be the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian Limited.Confidential and proprietary.

Stepwise Logistic RegressionLecture for FMI Students 27.05.2010

Alexander Efremov

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 2

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 3

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 4

Introduction Applications of the Logistic Regression

Medicine – diagnostics, modeling of disease growth, treatment effect

Psychology – learn process modeling, psychological tests evaluation

Economics – risk analysis, countries debt investigation, occupational choices

Marketing – products consumption, retailers actions effect

Criminology – risk factors for performing of criminal act

Sociology – employment, graduation, vote analysis

Ecology – modeling population growth

linguistics – language changes

Chemistry – reaction models

Media – news effects, copycat reaction

Finance – credit scoring, fraud detection

Physics, Biology, etc.

The Logistic Model

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 5

Introduction System Under Investigation

Individuals /rough data/ => System => Model

=>=>

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 6

IntroductionSystem Identification Stages

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 7

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 8

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 9

Part I. Logistic Regression Model DevelopmentLogistic Model

Linear relation Logistic relation

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 10

k

ky

ky

N

– index of current individual – intercept

– number of observations – the i+1-th model parameter

– dependent variable – the i-th independent variable /prob. of good/

– model output – i-th independent variable/predicted prob. of good/

Part I. Logistic Regression Model DevelopmentLogistic Model

Logistic Relation – General Form “Linear” Log. Regression Model

k

k

M

M

ke

ey

+=

kMke

y −+=

1

knnkk xxM ,,110 ... θθθ +++=

)...( ,,1101

knnk xxke

y θθθ +++−+=

knnkyy xx

k

k,,110ˆ1

ˆ...ln θθθ +++=−

kix ,

ni ,1=

Nk ,1=

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 11

Part I. Logistic Regression Model DevelopmentLogistic Model

Notation

� Parameters vector

� Regression vector

� Logistic model

1+∈ nRθ

1+∈ nk Rϕ

Tn ]...[ 10 θθθθ =

Tknkk xx ]...1[ ,,1=ϕ

θϕθθθ Tkknnk ee

yxxk −+++− +

=+

=1

1

1

)...( ,,110

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 12

Part I. Logistic Regression Model DevelopmentResidual

The Residual

kkkk eyee

y Tk

+=++

=−

ˆ1

1θϕ

=−=−

=−=0,ˆ

1,ˆ1ˆ

for

for

kk

kkkkk yy

yyyye

Sources of Uncertainty

� Unavailable significant factors

� Simplified relations

� Time-varying performance

� Database errors

� Fraud

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 13

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 14

Part I. Logistic Regression Model DevelopmentMaximum Likelihood Estimator

Cost Function

� Model output

� Likelihood contribution

� Likelihood function

� Log-likelihood function

Maximum Likelihood Criterion

kk yk

ykk yyl −−= 1

, )ˆ1(ˆθ

θθ

θθLL ln2minlnmax −⇔

∏=

=N

kklL

1,θθ

∑=

−−+=N

kkkkk yyyyL

1

))ˆ1ln()1(ˆln(ln θ

)|1(ˆ kkk yPy ϕ==

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 15

Part I. Logistic Regression Model DevelopmentMaximum Likelihood Estimator

Cost Function /-2 Log L/ for a Real Life Case

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 16

Tailor Series Expansion

Cost Function Models

� Linear model

� Quadratic model

Part I. Logistic Regression Model DevelopmentMaximum Likelihood Estimator

)()()1( ˆˆ iii θθθ ∆+=+

)()()(ˆ

)( )( iTiii gfM θθ

∆+=

)()()(21)()()(

ˆ)( )()( iiTiiTiii HgfM θθθ

θ∆∆+∆+=

3)()()(

21)()()(

ˆ)(

ˆ )()( OHgff iiTiiTiii +∆∆+∆+=∆+

θθθθθθ

)(ˆ

)( iTi fgθ

∇=)(

ˆ2)( ii fH

θ∇=

Cost function

Gradient

Hessian

)(ˆ

)(ˆ ln ii Lf

θθ−=

?)( =∆ iθ

Estimates Update

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 17

Part I. Logistic Regression Model DevelopmentMaximum Likelihood Estimator

Gradient Hessian

I-st Order Methods II-nd Order Method/e.g. Steepest Descent/ /e.g. Newton-Raphson/

gαθ −=∆ gH 1−−=∆ αθ

[ ] 1

10

+∂∂

∂∂

∂∂ ∈= nTfff Rg

nθθθ L11

2

2

1

2

0

2

1

2

21

2

01

2

0

2

10

2

20

2

+×+

∂∂

∂∂∂

∂∂∂

∂∂∂

∂∂

∂∂∂

∂∂∂

∂∂∂

∂∂

= nn

fff

fff

fff

RH

nnn

n

n

θθθθθ

θθθθθ

θθθθθ

L

MOMM

L

L

θ(0) 1

2

θ*θopt

1

2

θ(0)

θ*θopt

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 18

Steepest Newton-Descent Raphson

(NR)

NR with NR with

Line Search Quadratic

Interpolation

1

2

θ(0)

θ*θopt

θ(0) 1

2

θ*θopt

Part I. Logistic Regression Model DevelopmentMaximum Likelihood Estimator

gαθ −=∆gH 1−−=∆ αθ

gH 1* −−=∆ αθgH 1* −−=∆ αθ

θ(0) 1

2

θ*θopt

θ(0) 1

2

θ*θopt

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 19

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 20

Numerical Problems

Matrix inversion, hence SVD, EVD, QR, etc.

Local Minima

Part I. Logistic Regression Model DevelopmentPotential problems

Model Overfitting

αθθ −=+ )()1( ˆˆ ii 1−H g

-2lnL

k

y2,k

yk

1,ky

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 21

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 22

Part I. Logistic Regression Model DevelopmentFrequently Used Statistics for Model Analysis

Individual Estimate Measures

� Standard error

� Wald statistic

� p-value

Overall Model Measures

� Coefficient of determination (R2)

� generalized R2

� gen. max. resc. R2

� Cost function

21

ˆ)ˆ( ~2ˆ

2

2

χθθ σ

θσ

θθ

i

i

i

iiiW == −

N

LL

eRθθ ˆln0ˆln

212

−=10ˆln2

1 −−= N

L

esR

θ

RsR

mR22 =

)(ˆ

)(ˆ ln2 ii Lf

θθ−=

iHi

)][diag( 1ˆ

−=θσ

21Pr χ>

χ

p-value

WWi

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 23

Part I. Logistic Regression Model DevelopmentFrequently Used Statistics for Model Analysis

Modified criteria

� Akaike Information Criterion (AIC)

� Schwarz Criterion (SC)

� Minimum Description Length (MDL), Final Prediction Error (FPE), etc.

Model Validation

� Data split into development and validation samples

nLAIC 2ln2 ˆˆ +−= θθ

)1ln(ln2 ˆˆ −+−= NnLSC θθ

AIC

-2lnL

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 24

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 25

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 26

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

xo, xe – sets of all variables, out/entered in the model

xoi, xei – the most/less significant variable

SLE – Significance Level to Enter

SLS – Significance Level to Stay

SWR

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 27

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

Available information

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 28

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

1

Initialization

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 29

Forward Selection

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

12

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 30

12 3

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

Forward Selection

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 31

2 3

Part II. Stepwise Logistic RegressionStepwise Logistic Regression – Basic Idea

Backward Elimination

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 32

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 33

Part II. Stepwise Logistic RegressionStep 0. Initialization

Logistic model

1. Intercept Model

2. Full model

3. One Factor Model

� Check for Enter

� Score Chi-Sq for all potential models

� Maximum Score Chi-Square

� p-value & threshold

� Model Determination (Optimization)

θϕTke

yk −+=

1

iiTii gHgS 1−=

R∈θ 1=kϕ1+∈ nRθ T

knkk xx ]1[ ,,1 K=ϕ

ii

Smaxarg1 =l

SLEvalue-p1

<l

Tkk x ]1[ ,1l

=ϕ2R∈θ

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 34

Part II. Stepwise Logistic RegressionStep 1. Forward Selection

1. Check for Enter

� Score Chi-Square of all potential models

� Maximum Score Chi-Square

� p-value & threshold

2. Model Determination (Optimization)

3. Statistics for Model Analysis

� Individual Estimate Measures

� standard error

� Wald statistic & p-value

iiTii gHgS 1−=

ii

i Smaxarg=l

SLEvalue-p <il

Tkkk i

xx ]1[ ,,1 llK=ϕ1+∈ iRθ

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 35

Part II. Stepwise Logistic RegressionStep 1. Forward Selection

3. Statistics for Model Analysis (part 2)

� Overall Model Measures

� Coefficients of determination

� Cost function

� Modified criteria

� Akaike Information Criterion (AIC)

� Schwarz Criterion (SC)

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 36

Part II. Stepwise Logistic RegressionStepwise Logistic Regression

SWR

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 37

Part II. Stepwise Logistic RegressionStep 2. Backward Elimination

1. Check for Leave

� Wald statistic & p-value of all potential models

� p-value & threshold

2. Model Determination (Optimization)

3. Statistics for Model Analysis

� Individual Estimate Measures

� standard error

� Wald statistic & p-value

Tkkkkk ijj

xxxx ]1[ ,,,, 111 llllKK

+−=ϕiR∈θ

SLLvalue-pmax >il

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 38

3. Statistics for Model Analysis (part 2)

� Overall Model Measures

� Coefficients of determination

� Cost function

� Modified criteria

� Akaike Information Criterion (AIC)

� Schwarz Criterion (SC)

Part II. Stepwise Logistic RegressionStep 2. Backward Elimination

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 39

Agenda

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 40

Part II. Stepwise Logistic RegressionPotential problems in the Stepwise Regression

Local Minima & Initial Conditions

Numerical Problems /SVD, EVD, QR, etc./

Model Overfitting

© Experian Limited 2007. All rights reserved.Confidential and proprietary. 41

Summary

IntroductionApplications of the Logistic Regression

System Identification & Stepwise Regression

Part I. Logistic Regression Model DevelopmentLogistic Model

Maximum Likelihood Estimator

Potential Problems

Model Analysis and Validation

Part II. Stepwise Logistic Regression (SWR)Basic Idea

SWR Algorithm

Potential Problems

Summary

© Experian Limited 2007. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Limited. Other product and company names mentioned herein may be the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian Limited.Confidential and proprietary.

Stepwise Logistic RegressionLecture for FMI Students 27.05.2010

Alexander Efremov

Thank You!

http://anp.tu-sofia.bg/aefremov/index.htm