Michael Friendly - YorkU Math and Stats · SPIDA 2004 2 Michael Friendly. Review of Linear Models...

24
Review of Linear Models and Model Building Strategies Lecture Outline Introduction General Linear Models: Overview SAS macros for statistical graphics Exploring and transforming data Transformations to symmetry Transformations to linearity - resistant lines, Box-Cox Dealing with heteroscedasticity Fitting and understanding linear models Fitting linear models with SAS Model diagnosis: Leverage and Influence Visualizing influence: Partial regression plots Model selection SPIDA 2004 1 Michael Friendly Review of Linear Models and Model Building Strategies X1 * X2 Interaction Y = 2*X1 + -1*X2 +0.20*X1*X2 10 0 -10 X2 -10 0 10 X1 Y -50 0 50 SLID: wgsal42c, Wages and salaries, 1994 Wages and salaries - 1994 (Std.) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Power -1/X -1/Sqrt Log Sqrt Raw Box-Cox Power Transform for Salary Root Mean Squared Error 200 300 400 500 600 700 800 900 1000 1100 1200 Box-Cox Power ( -2 -1 0 1 2 Michael Friendly York University, <[email protected]> SPIDA June, 2004 Review of Linear Models and Model Building Strategies General Linear Models: Overview Quantitative response: Linear models attempt to describe, predict or explain a quantitative response variable (y) from one or more predictor / explanatory variables (xs) One predictor: simple linear regression: Q: How does y change as x changes? M: y i = β 0 + β 1 x i + i One-way ANOVA: Q: How does the mean of y change over levels of factor A? M: y ij = µ + α i + ij Prestige score 10 20 30 40 50 60 70 80 90 Education (years) 6 7 8 9 10 11 12 13 14 15 16 Medical_tech Prestige score 10 20 30 40 50 60 70 80 90 Job Type Blue collar White collar Professional SPIDA 2004 3 Michael Friendly Review of Linear Models and Model Building Strategies Resources SAS macro programs, from SAS System for Statistical Graphics (Friendly, 1991): http://www.math.yorku.ca/SCS/sssg/ http://www.math.yorku.ca/SCS/sasmac/ SAS macro programs, from Visualizing Categorical Data (Friendly, 2000): http://www.math.yorku.ca/SCS/vcd/ Workshop notes: http://www.math.yorku.ca/SCS/spida/lm/ SPIDA 2004 2 Michael Friendly

Transcript of Michael Friendly - YorkU Math and Stats · SPIDA 2004 2 Michael Friendly. Review of Linear Models...

Page 1: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lectu

reO

utlin

e

Intro

du

ction

GeneralLinear

Models:

Overview

SA

Sm

acrosfor

statisticalgraphics

Exp

lorin

gan

dtran

sform

ing

data

Transformations

tosym

metry

Transformations

tolinearity

-resistantlines,B

ox-Cox

Dealing

with

heteroscedasticity

Fittin

gan

du

nd

erstand

ing

linear

mo

dels

Fitting

linearm

odelsw

ithS

AS

Modeldiagnosis:

Leverageand

Influence

Visualizing

influence:P

artialregressionplots

Modelselection

SP

IDA

20041

MichaelFriendly

Review

of

Lin

earM

od

elsan

dM

od

elBu

ildin

gS

trategies

X1 * X

2 InteractionY

= 2*X

1 + -1*X

2 +0.20*X

1*X2

10

0-10

X2

-10

0

10

X1

Y

-50 0 50

SLID

: wgsal4

2c, W

ages a

nd s

ala

ries, 1

994

Wages and salaries - 1994 (Std.)

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

Pow

er

-1/X

-1/S

qrt

Log

Sqrt

Raw

Box-C

ox P

ow

er T

ransfo

rm fo

r Sala

ry

Root Mean Squared Error

200

300

400

500

600

700

800

900

1000

1100

1200

Box-C

ox P

ow

er (

-2-1

01

2

MichaelFriendly

YorkU

niversity,<[email protected]>

SP

IDA

June,2004

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Qu

antitative

respo

nse:

Linearm

odelsattem

pttodescribe,predictor

explaina

quantitativeresponse

variable(y

)from

oneor

more

predictor/

explanatoryvariables

( xs)

On

ep

redicto

r:sim

plelinear

regression:•

Q:H

owdoes

ychange

asx

changes?•

M:y

i=

β0

1 xi +

εi

One-w

ayA

NO

VA

:•

Q:H

owdoes

them

eanofy

changeover

levelsoffactor

A?

•M

:yij

i +εij

Prestige score

10 20 30 40 50 60 70 80 90

Education (years)

67

89

1011

1213

1415

1 6

Medical_tech

Prestige score

10 20 30 40 50 60 70 80 90

Job Type

Blue collar

White collar

Professional

SP

IDA

20043

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategiesReso

urces

SA

Sm

acroprogram

s,fromS

AS

System

forS

tatisticalGraphics

(Friendly,1991):

http://www.math.yorku.ca/SCS/sssg/

http://www.math.yorku.ca/SCS/sasmac/

SA

Sm

acroprogram

s,fromV

isualizingC

ategoricalData

(Friendly,2000):

http://www.math.yorku.ca/SCS/vcd/

Workshop

notes:http://www.math.yorku.ca/SCS/spida/lm/

SP

IDA

20042

MichaelFriendly

Page 2: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Hom

ogeneityofregression:

•Q

:Isthe

relationofy

onx

thesam

efor

alllevelsoffactor

A?

•M

:yij

k=

β0

i +β

j xij

+(α

β)ij (x

ij )+

εij

k

AN

CO

VA

:•

Q:H

owdoes

them

eanofy

changeover

levelsofA

,controlling(adjusting)

fory

onx

•M

:yij

k=

β0

i +βx

ij+

εij

k

Job Type

Blue collar

White collar

Professional

Prestige score

10 20 30 40 50 60 70 80 90

Education (years)

67

89

1011

1213

1415

16

Job Type

Blue collar

White collar

Professional

Prestige score

10 20 30 40 50 60 70 80 90

Education (years)

67

89

1011

1213

1415

16

SP

IDA

20045

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Two

pred

ictors:

multiple

linearregression:

•Q

:How

doesy

changeas

x1

andx

2change?

•M

:yi=

β0

1 xi1

2 xi2

+εi

Multiple regression response surface

Y =

2*X1 +

-1*X2

10

0-10

X2

-10

0

10

X1

Y

-50 0 50

Multiple regression response surface

Y =

2*X1 +

-1*X2

-10

0

10

X1

-10

010

X2

Y

-50 0 50

Two-w

ayA

NO

VA

:•

Q:H

owdoes

them

eanofy

changeover

levelsoffactors

Aand

B?

•M

:yij

k=

µ+

αi +

βj+

(αβ)ij

+εij

k

SP

IDA

20044

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Ineach

case,we

canrepresentthe

modelin

thesam

eform

:

yi

0+

β1 x

i1+

β2 x

i2+

···+β

p xip

+εi

response=

wtd.

sumofpredictors

+residual

data=

explained(partialsum

mary)

+unexplained

where

thex

scan

be:

Quantitative

regressors:age,incom

e,educationTransform

edregressors: √

age,log(income)

Polynom

ialregressors:age

2,age3,···

Categoricalpredictors:

treatment,sex—

codedas

“dumm

y”(0/1)

variablesInteraction

regessors:treatm

ent×age,sex×

ageA

nycom

binationsofthe

above⇒the

GeneralLinear

Model

“Linearm

odel”→linear

inthe

parameters,β

1 ,β2 ,β

3 ,...,e.g.,

yi=

β0

1 age+

β2 age

2+

β3log(incom

e)+

β4 (sex=

’F’)

+εi

SP

IDA

20047

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Response

surfacem

odels:•

Q:Is

therelation

ofyto

x1

andx

2linear?

•M

:yi=

β0

1 xi1

11 x

2i1+

β2 x

i2+

β22 x

2i2+

εi

Models

with

interactions:•

Q:Is

therelation

ofyto

x1

thesam

efor

allx2

?•

M:y

i=

β0

1 xi1

2 xi2

12 x

i1 xi2

+εi

Quadratic response surface

Y =

2*X1 +

-1*X2 +

0.5*X1*X

1+-0.2*X

2*X2

10

0-10

X2

-10

0

10

X1

Y

-50 0 50

X1 * X

2 InteractionY

= 2*X

1 + -1*X

2 +0.20*X

1*X2

10

0-10

X2

-10

0

10

X1

Y

-50 0 50

SP

IDA

20046

MichaelFriendly

Page 3: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:A

ssum

ptio

ns

Inthe

GLM

,forvalidity

ofinference,we

mustm

akesom

eassum

ptions(the

“Holy

Trinity”):

Ind

epen

den

ce:allerrors,ε

i ,ε ′iare

statisticallyindependent:

Cov(ε

i ,ε ′i )=

0K

eydifference

between

ordinaryG

LMs

vs.Mixed

models,H

LMs,

repeated/longitudinalmodels

Co

nstan

terro

rvarian

ce:V

ar(εi )≡

Var(y

i |xi )

2=

constantK

eydifference

between

ordinaryG

LMs

vs.logisticregression,P

oissonregression

andthe

Generalized

LinearM

odelN

orm

alityo

ferro

rs:T

heerrors,ε

i ,havea

normaldistribution

εi

∼N

IDN(0

,σ2)

ε∼

N(0

,σ2I)

Inaddition,w

eim

plicitlyassum

e:

Mo

delsp

ecificatio

n:

thecorrectx

shave

beenincluded,each

inthe

correctformF

ixedx

s:the

predictorvariables

arem

easuredw

/oerror—

rarelytrue!

SP

IDA

20049

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Gen

eralLin

earM

od

els:O

verview

Allofthese

canbe

representedin

matrix

form,

y=

(1)

or,y1...yn

= 1

x11

x12

···x

1p

1x

21

x22

···x

2p

······

···...

···1

xn1

xn2

···x

np

β0...βp

+ ε1...εn

(2)

Inallcases,

Param

eterestim

ates:β

=(X

TX

) −1X

Ty

Residuals

=estim

atederrors

=e

=y−

y=

y−X

βR

esidualvariance:M

SE≡

Var(ε)

=(e

Te)/(n−

p−1)

Standard

errors:V

ar(β)

=M

SE(X

TX

) −1

Param

etertests:

H0

i=

0⇒t=

βi / √

Var(β

i )∼t(n−

p−1)

SP

IDA

20048

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Sid

ebar:

Usin

gS

AS

macro

s

E.g.,the

SYMBOX

macro

isdefined

with

thefollow

ingargum

ents:

symbox.sas

···1%macrosymbox(

2data=_last_,

/*

name

of

input

data

set

*/

3var=,

/*

name(s)

ofthe

variable(s)

to

examine

*/

4id=,

/*

name

of

IDvariable

*/

5out=symout,

/*

name

of

output

data

set

*/

6orient=V,

/*

orientation

of

boxplots:

Hor

V*/

7powers=-1-0.50

.5

1,

/*

list

of

powers

to

consider

*/

8name=symbox

/*

name

for

graph

in

graphics

catalog

*/

9);

Typicaluse:

1%symbox(data=baseball,

2var=SalaryRuns,

/*analysisvariables

*/

3id=name,

/*playerID

variable

*/

4powers=-1

-.50.51

2);

SP

IDA

200411

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Sid

ebar:

Usin

gS

AS

macro

s

SA

Sm

acrosare

high-level,generalprograms

consistingofa

seriesofD

ATA

stepsand

PROC

steps.

Keyw

ordargum

entssubstitute

yourdata

names,variable

names,and

optionsfor

thenam

edm

acroparam

eters.

Use

as:

%macname(data=dataset,var=variables,...);

e.g.,%boxplot(data=nations,var=imr,class=region,id=nation);

Mostargum

entshave

defaultvalues(e.g.,d

ata=last

)

AllS

SS

Gand

VC

Dm

acroshave

internaland/oronline

documentation,

http://www.math/yorku.ca/SCS/sssg/

http://www.math/yorku.ca/SCS/sasmac/

http://www.math/yorku.ca/SCS/vcd/

Macros

canbe

installedin

directoriesautom

aticallysearched

byS

AS

.Putthe

following

options

statementin

yourAUTOEXEC.SAS

file:

options

sasautos=(’c:\sasuser\macros’

sasautos);

SP

IDA

200410

MichaelFriendly

Page 4: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mu

ltivariated

isplays

BIPLOT

Generalized

biplotofobservationsand

variablesCORRGRAM

Draw

acorrelogram

OUTLIER

Robustm

ultivariateoutlier

detectionSCATMAT

Scatterplotm

atrixSTARS

Star

plotform

ultivariatedata

Lin

earan

dG

eneralized

Lin

earM

od

els

ADDVAR

Added

variableplots

forlogistic

regressionBOXCOX

Pow

ertransform

ationsby

Box-C

oxm

ethod(PROC

REG

)BOXGLM

Pow

ertransform

ationsby

Box-C

oxm

ethod(PROC

GLM

)BOXTID

Pow

ertransform

ationsby

Box-T

idwellm

ethodCPPLOT

Plots

ofMallow

’sC

(p)and

relatedstatistics

form

odelselectionDUMMY

Constructdum

my

variablesfor

regressionm

odelsHALFNORM

Half-norm

alplotsfor

generalizedlinear

models

INFLGLIM

Influenceplots

forgeneralized

linearm

odelsINFLOGIS

Influenceplots

forlogistic

regressionINFLPLOT

Influenceplotfor

regressionm

odelsINTERACT

Create

interactionvariables

MEANPLOT

Plotm

eansfor

factorialdesignsPARTIAL

Partialresidualand

partialregressionplots

ROBUST

Robustfitting

forlinear

models

(REG

,GLM

,LOGISTIC

)via

IRLS

RSQDELTA

Com

puteR

-squarechange

andF

-statisticsin

regressionSPRDPLOT

Spread-Levelplotto

findtransform

ationto

equalizevariances.

SP

IDA

200413

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

SS

SG

Macro

s&

pro

gram

s

Macros

availableath

ttp://www.math.yorku.ca/SCS/sssg/

,.../SCS/sasmac/

Un

ivariated

isplays

BOXPLOT

Box-and-w

hiskerplots

DATACHK

Basic

datascreening

fornum

ericvariables

NQPLOT

Norm

alQQ

plotSYMBOX

Boxplots

fortransform

ationsto

symm

etrySYMPLOT

Diagnostic

plotsfor

transformations

tosym

metry

Bivariate

disp

lays

CONTOUR

Plotellipticalcontours

forX

,Ydata

LOWESS

Locallyw

eightedscatterplotsm

ootherRESLINE

Resistantline

forbivariate

dataSUNPLOT

Sunflow

erplotfor

X-Y

data

SP

IDA

200412

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tosym

metry

Transformations

haveseveraluses

indata

analysis,including:

making

adistribution

more

symm

etric.equalizing

variability(spreads)

acrossgroups.

making

therelationship

between

two

variableslinear.

These

goalsoften

coincide:a

transformation

thatachievesone

goalwilloften

helpfor

another(butnotalw

ays).

Som

etools

(Friendly,1991):

Understanding

theladder

ofpowers.

SYMBOX

macro

-boxplots

ofdatatransform

edto

variouspow

ers.SYMPLOT

macro

-various

plotsdesigned

toassess

symm

etry.P

OW

ER

plot:line

with

slopeb⇒

y→y

p,where

p=

1−b

(roundedto

0.5).BOXCOX

macro

-for

regressionm

odel,transformy→

yp

tom

inimize

MS

E(or

maxim

umlikelihood);influence

plotshows

impactofobservations

onchoice

ofpow

er(B

oxand

Cox,1964).

BOXGLM

macro

-for

GLM

(anova/regression),transformy→

yp

tom

inimize

MS

E(or

max.

likelihood)BOXTID

macro

-for

regression,transformx

i →x

pi(B

oxand

Tidw

ell,1962).

SP

IDA

200415

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Part

2:E

xplo

ring

and

transfo

rmin

gd

ata

Transformations

tosym

metry

Transformations

tolinearity

-resistantlines,B

ox-Cox

Dealing

with

heteroscedasticity

SP

IDA

200414

MichaelFriendly

Page 5: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

For

simplicity:

usuallyuse

onlysim

pleinteger

andhalf-integer

powers

(sometim

es,

p=

1/3→

3 √x

)

scalethe

valuesto

keepresults

simple.

Pow

erTransform

ationR

e-expression

3C

ubex

3/100

2S

quarex

2/10

1N

ON

E(R

aw)

x

1/2S

quareroot

√x

0Log

log10x

-1/2R

eciprocalroot−

10/ √

x

-1R

eciprocal−

100/x

SP

IDA

200417

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

–L

add

ero

fP

ow

ers

Pow

ertransform

ationsare

oftheform

x→x

p.

Ausefulfam

ilyoftransform

ationsis

ladderofpow

ers(Tukey,1977),defined

as

x→tp (x),

tp (x)

= {x

p−1

pp�=

0log

10x

p=

0(3)

Key

ideas:

log(x)plays

therole

ofx0

inthe

family—

halfway

between

-1/2(−

1/ √

x)

and1/2

( √x

)

1/p→

keepsorder

ofxthe

same

forp

<0

,e.g.,p=

−1→

−1/x

SP

IDA

200416

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lad

der

of

Po

wers

–E

xamp

le

SLID

(Ontario

subset)-

Wages

andS

alaries

SYMBOX

macro

-transform

sa

variableto

alistofpow

ers,showstandardized

scoresusing

theBOXPLOT

macro

title’SLID:wgsal42c,Wagesandsalaries,1994’;

%symbox(data=slid.pontario,

var=wgsal42c,

/*variable*/

powers=-1-0.500.51);

/*listof

powers*/

SLID

: wgsal4

2c, W

ages a

nd s

ala

ries, 1

994

Wages and salaries - 1994 (Std.)

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

Po

we

r-1

/X-1

/Sq

rtL

og

Sq

rtR

aw

SP

IDA

200419

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lad

der

of

Po

wers

—P

rop

erties

Preserve

the

ord

ero

fd

atavalu

es.Larger

datavalues

onthe

originalscalew

ill

belarger

onthe

transformed

scale.(T

hat’sw

hynegative

powers

havetheir

sign

reversed.)

Th

eych

ang

eth

esp

acing

of

the

data

values.

Pow

ersp

<1

,suchas √

xand

logx

compress

valuesin

theupper

tailofthe

distributionrelative

tolow

values;

Pow

ersp

>1

,suchas

x2,have

theopposite

effect,expandingthe

spacingof

valuesin

theupper

endrelative

tothe

lower

end.

Sh

ape

of

the

distribu

tion

chan

ges

systematically

with

p.

If √x

pullsin

the

uppertail,log

xw

illdoso

more

strongly,andnegative

powers

willbe

strongerstill.

Req

uires

allx>

0.

Ifsome

valuesare

negative,adda

constantfirst,i.e.,

x→tp (x

+c)

Has

aneffectonly

iftheran

ge

ofx

values

ism

od

eratelylarg

e.

SP

IDA

200418

MichaelFriendly

Page 6: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lad

der

of

Po

wers

–E

xamp

le

SLID

(Ontario

subset)-

Hourly

wage

SYMBOX

macro

-transform

sa

variableto

alistofpow

ers,showstandardized

scoresusing

theBOXPLOT

macro

title’SLID:cmphw28c,Comp.hourlywage,1994’;

%symbox(data=slid.pontario,var=cmphw28c,

powers=-1-0.500.51);

SLID

: cm

phw

28c, C

om

p. h

ourly

wage, 1

994

Comp. hrly wage all jobs - 1994 (Std.)-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Po

we

r-1

/X-1

/Sq

rtL

og

Sq

rtR

aw

SP

IDA

200421

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

SLID

: wgsal4

2c, W

ages a

nd s

ala

ries, 1

994

Wages and salaries - 1994 (Std.)

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

Po

we

r-1

/X-1

/Sq

rtL

og

Sq

rtR

aw

wgsal42c→

log(wgsal42c)

looksbest.

SP

IDA

200420

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tolin

earity

Brain

weightand

bodyw

eightofmam

mals:

Marginalboxplots

showthatboth

variablesare

highlyskew

ed

Mostpoints

bunchedup

atorigin

Relation

isstrongly

non-linear

Logtransform

removes

bothproblem

s

Brain w

eight and body weight of m

amm

alsBrain weight

0

1000

2000

3000

4000

5000

6000

Body w

eight0

10002000

30004000

50006000

7000

Brain w

eight and body weight of m

amm

als

log10 (Brain weight)-1 0 1 2 3 4

log10 (Body w

eight)-3

-2-1

01

23

4

SP

IDA

200423

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

SLID

: cm

phw

28c, C

om

p. h

ourly

wage, 1

994

Comp. hrly wage all jobs - 1994 (Std.)-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Po

we

r-1

/X-1

/Sq

rtL

og

Sq

rtR

aw

cmphw

28c→√

orlog(cmphw

28c)looks

OK

.

See

http://www.math.yorku.ca/SCS/sasmac/symbox.html

SP

IDA

200422

MichaelFriendly

Page 7: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tolin

earity

Tukey’sarrow

ruleand

thedouble

ladderofpow

ers:

Draw

anarrow

inthe

directionofthe

“bulge”.

The

arrowpoints

inthe

directionto

move

alongthe

ladderofpow

ersfor

xor

y(or

both).

(a)

(b)

(c)

(d)

...

log

sqrt

raw

Y2

Y3

...

...lo

gsqrt

X2

X3

...

SP

IDA

200425

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tolin

earity

Ifyis

aresp

on

se(“dependent”)

andx

isa

predictor,we

oftenw

anttofit

y=

f(x)+

residual

Generally

we

prefera

“simple”f(x),like

alinear

function,y=

a+

bx

+residual.

Iftherelation

between

yand

xis

substantiallynon-linear,w

ehave

two

choices:

Ben

dth

em

od

el:Try

fittinga

quadratic,cubic,orother

polynomial(easy:

linearin

parameters),or

elsea

non-linearm

odel,e.g., y=

aexp(bx)

(harder).

Un

ben

dth

ed

ata:Transform

eithery→

y ′,orx→

x ′(or

both),sothatrelation

islinear,

y ′=a

+bx ′+

residual

Ladderofpow

ersand

Tukey’s“arrow

rule”indicate

which

directionto

go.

A“ratio

ofslopes”table

pinpointsgood

power

transformations.

SP

IDA

200424

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Ratio

ofslopes

The

curvatureofthe

datacan

bem

easuredby

theratio

ofslopes

r=

upperslope

lower

slope=

(yH−

yM

)/(xH−

xM

)(y

M−

yL )/(x

M−

xL )

e.g.,

XY

half-slope

ratio

High

4600.6275157.515

1.1058

Mid

10.000

80.996

0.1391

7.9465

Low

0.122

2.500

r < 1

r = 1

r > 1

Y5

10

15

XLow

Mid

dle

Hig

h

Y5

10

15

XLow

Mid

dle

Hig

h

Y5

10

15

XLow

Mid

dle

Hig

h

Alinear

relation⇒r≈

1(orlog

r≈0

)

SP

IDA

200427

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tolin

earity

Resistantlines

andthe

ratioofslopes

table(Tukey,1977):

Leastsquaresregression

cangive

misleading

resultsw

ithhighly

skewed

dataor

with

outliers

Aresistantline

oftendoes

betterw

ithill-behaved

data

Sum

mary

values–

medians

ofthirds,dividingby

X-values

(butneitherend-third

cancover

more

than1/2

therange)

SummaryValues

XY

n

Low

0.122

2.500

21

Mid

10.000

80.996

39

High

4600.6275157.515

2R

SP

IDA

200426

MichaelFriendly

Page 8: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

For

thisdata,values

ofr≈1

tendto

runalong

thediagonal

log-logis

thebestcom

bination

-----RatioofSlopestable------

Rowsarepowersof

X,columnsarepowersofY

-1.0

-0.5

log

sqrt

raw

2.0

-1.0

2.544

15.127

96.908

687.0705247.745329241.7

-0.5

0.265

1.575

10.089

71.527

546.31434275.54

log

0.023

0.134

0.858

6.085

46.4772915.947

sqrt

0.001

0.008

0.052

0.368

2.813

176.504

raw

0.000

0.000

0.003

0.018

0.139

8.731

2.0

0.000

0.000

0.000

0.000

0.000

0.019

-------5

Bestpowers-------

PowerofX

PowerofY

SlopeRatio

logRatio

log

log

0.858

-0.066

-0.5

-0.5

1.575

0.197

-1.0

-1.0

2.544

0.405

sqrt

sqrt

0.368

-0.434

sqrt

raw

2.813

0.449

See

http://www.math.yorku.ca/SCS/sasmac/resline.html

SP

IDA

200429

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

The

effectofanytransform

ation,x→x

p,y→y

q,canbe

judgedby

theeffectithas

onthe

ratioofslopes,

r(p

,q)=

(yqH−

yqM

)/(xpH−

xpM

)(y

qM−

yqL )/(x

pM−

xpL )

The

resline

macro

calculatesthe

ratioofslopes

fora

setofpowers

ofxand

ofy%resline(data=brains,

x=bodywt,

y=brainwt,

id=mammal);

SP

IDA

200428

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Taiw

an

Zam

bia

Papua N

ew

Guin

ea

Lebanon

Saudi A

rabia

Lib

ya

IMR

data

: Resid

uals

from

log-lo

g fit

Residual

-1.0

-0.5

0.0

0.5

1.0

1.5

log In

com

e1

23

4

SP

IDA

200431

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

tolin

earity

Infantmortality

rateand

per-capitaincom

e

Arrow

pointstow

ardlow

erpow

ersofx

and/ory

Ratio

ofslopessuggestlog

x,log

y

IMR

vs. P

er C

ap

ita In

co

me

Infant Mortality Rate

0

100

200

300

400

500

600

700

Per C

apita

Incom

e0

1000

2000

3000

4000

5000

6000

IMR

da

ta: lo

g-lo

g fit

log Infant Mortality

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

log In

com

e1.5

2.0

2.5

3.0

3.5

4.0

SP

IDA

200430

MichaelFriendly

Page 9: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-C

oxTran

sform

ation

s

Baseballdata:

predictingS

alaryfrom

Years,RB

Ic,HIT

Sc.

CI(λ

)includes

λ=

0→log(S

alary)

Effects

plotshowst

statisticfor

eachregressor

The

boxcox

macro

providesthe

RM

SE

,EF

FE

CT

S,and

INF

Lplots:

basecox.sas

1title’Box-CoxtransformationforBaseballsalary’;

2%includedata(baseball);

3%boxcox(data=baseball,

4id=name,

/*

playerID

*/

5resp=Salary,

/*

response

*/

6model=YearsHITScRBIc,/*

predictors*/

7gplot=RMSEEFFECTINFL);

SP

IDA

200433

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-C

oxTran

sform

ation

s

Another

way

toselectan

“optimal”

transformation

ofyin

regressionis

toadd

a

parameter

forthe

power

tothe

model,

y(λ

)=

where

λis

thepow

erofy

in(the

‘ladder’)

y(λ

)= {

yλ−

,λ�=

0log

y,

λ=

0

Box

andC

ox(1964)

proposeda

maxim

umlikelihood

procedureto

estimate

the

power

(λ)

alongw

iththe

regressioncoefficients

(β).

This

isequivalentto

minim

izing √M

SE

overchoices

ofλ.⇒

fitthem

odelfora

rangeofλ

(-2to

+2,say)

The

maxim

umlikelihood

method

alsoprovides

a95%

confidenceintervalfor

λ.

Plotofpartialt

orF

foreach

regressorvs.λ→

sensitivityto

power.

SP

IDA

200432

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-C

oxTran

sform

ation

s:G

LM

s

For

grouping(C

LAS

S)

predictors,we

cando

thesam

eanalysis,using

PROC

GLM

.

The

BOXGLM

macro

handlesm

odelsw

ithCLASS

variables

pontboxglm.sas

1title’SLID:cmphw28c,Hourlywages,1994’;

2%boxglm(data=slid.pontario,

3resp=cmphw28c,

4model=YrSch18ceAge26cSex21MoTn2g15,

5class=sex21motn2g15,

6lopower=-1.6,

7gplot=RMSEEFFECT);

Wages:

cmphw

28c→log(cm

phw28c)

See

http://www.math.yorku.ca/SCS/sasmac/boxglm.html

SP

IDA

200435

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Bo

x-C

ox P

ow

er T

ran

sfo

rm fo

r Sa

lary

Root Mean Squared Error

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

Bo

x-C

ox P

ow

er (

-2-1

01

2

Years

HIT

Sc

RB

Ic

t-va

lue

s fo

r Mo

de

l Effe

cts

on

Sa

lary

t-value

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Bo

x-C

ox P

ow

er (

-2-1

01

2

See

http://www.math.yorku.ca/SCS/sasmac/boxcox.html

SP

IDA

200434

MichaelFriendly

Page 10: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-C

ox:S

core

testan

din

flu

ence

plo

t

Ascore

testisbased

onthe

slopeofthe

logL

functionatλ

=1

(slope≈0↔

atmaxim

um)

For

Box-C

ox,thiscan

beform

ulatedas

thet

statisticfor

aconstructed

variable,g,

gi=

yi (log

yiy−

1)

where

yis

thegeom

etricm

eanofthe

yi .

Fitthe

model y

=X

β+

φg

.

TestH0

=0

(↔λ

=1).

Another

estimate

ofλis

1−φ

.

Apartialregression

plotforgshow

sthe

influenceofindividualobservations

onthe

choiceofthe

transformation.

SP

IDA

200437

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-C

ox P

ow

er T

ransfo

rm fo

r cm

phw

28c

Root Mean Squared Error

5 6 7 8 9

10

11

Box-C

ox P

ow

er (

-2-1

01

2

YrS

ch

18

c e

Ag

e2

6c

Se

x2

1

Mo

Tn

2g

15

F-v

alu

es fo

r Model E

ffects

on c

mphw

28c

F-value

0

100

200

300

400

500

600

700

800

900

1000

1100

Box-C

ox P

ow

er (

-2-1

01

2

SP

IDA

200436

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Transfo

rmatio

ns

of

pred

ictors

Inany

correlationalanalysis(e.g.,regression,factor

analysis)w

ecan

getasim

ple

overviewofthe

relationsby

Plotting

allpairsofvariables

together(scatmat

macro)

Draw

inga

quadraticregression

curvefor

eachpair

%scatmat(...,interp=rq

).

“curves”w

illbestraightw

henthe

relationsare

linear.

(lowess

fitsare

better,butmore

computationally

intensive.)

SP

IDA

200439

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Baseballdata:

predictingS

alaryfrom

Years,RB

Ic,HIT

Sc.

The

influenceplotshow

sthata

fewplayers

arestrongly

determining

thechoice

ofpower,butthey

arenotoutofline

with

therest.

The

slope(φ

)again

leadsto

thechoice

λ=

0⇒log

y

Plotproduced

bythe

BOXCOX

macro

(with

GP

LOT

=IN

FL):

Mu

rray

Ric

e

Sch

mid

t

Sm

ith

Slo

pe

: 0.9

39

Po

we

r: 0

Partia

l Regre

ssio

n In

fluence p

lot fo

r Box-C

ox p

ow

er

Partial Salary

-20

00

-10

00 0

10

00

20

00

Pa

rtial C

on

stru

cte

d V

aria

ble

-10

00

01

00

02

00

0

SP

IDA

200438

MichaelFriendly

Page 11: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-T

idw

ellTransfo

rmatio

ns

Box

andT

idwell(1962)

suggesteda

modelto

determine

transformations

ofthe

Xs,

y=

β0

1 xγ1

1+

···βk x

γk

k+

ε

Param

etersofthis

model—

β0 ,β

1...β

k ,γ1...γ

kcan

beestim

atedby:

1.R

egressy

onx

1 ,...,xk →

b0 ,b

1 ,...bk

.

2.C

reateconstructed

variables,x1log

x1 ,...x

klog

xk

.

3.R

egressy

onx

1 ,...,xk ,

x1log

x1 ,...x

klog

xk

→b ′0 ,b ′1 ,...b ′k ,g

1 ,...gk

4.E

stimate

ofthepow

erγ

iis

givenby

γ=

1+

gi /

bi

5.R

epeatsteps3,4

untilγconverge

(givesM

LE).

The

constructedvariables,x

i logx

i ,canbe

usedto

testtheneed

fora

transformation

ofxi :

TestH0

i=

1from

testofcoefficientofxi log

xi=

0.

Partialregression

plotsfor

theconstructed

variableshelp

toassess

theleverage

andinfluence

onthe

decisionto

transforman

xvariable.

SP

IDA

200441

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

e.g.,Canadian

occupationalprestige:%

wom

en,income,education

Pre

stig

e

14

.8

87

.2

Wo

me

n

0

97

.51

Ed

uc

6.3

8

15

.97

Inco

me

61

1

25

87

9

→P

restigenon-linear

w.r.t.

Educ

andIncom

e

SP

IDA

200440

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

...and

(score)tests

forpow

ertransform

ations

Scoretestsforpowertransformations

Power

StdErr

ScoreZProb>|Z|

EDUC

2.2109

4.9114

2.4097

0.0160

INCOME

-0.0426

0.0000

-5.2625

0.0000

Pow

ersare

roundedto

thenearest0.5:

Educ→

Educ

2,Incom

e→log

Income.

SP

IDA

200443

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-T

idw

elltransfo

rmatio

ns:

Exam

ple

Canadian

OccupationalP

restige–

findpow

ersfor

Educ

andIncom

e

The

BOXTID

macro

carriesoutthis

procedure:

%boxtid(data=prestige,

yvar=Prestige,

id=job,

xvar=WomenEducIncome,

/*

varsin

model

*/

xtrans=EducIncome,

/*

varsto

xform

*/

round=.5,

/*

roundpowers

*/

out=boxtid);

/*

outputdataset*/

Printed

resultsshow

theiteration

history...

IterationHistory:TransformationPowers

Iteration

EDUC

INCOME

Criterion

12.2551

-0.9132

1.9132

22.3790

0.8273

1.9059

32.3593

-0.6834

1.8261

42.3221

0.4444

1.6503

...

13

2.2109

-0.0426

0.0005

SP

IDA

200442

MichaelFriendly

Page 12: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-T

idw

elltransfo

rmatio

ns:

Exam

ple

The

BOXTID

macro

createsthe

transformed

variablesfor

you(e.g.,t

_income

).

Plotw

ithLOWESS

macro,adding

linearregression

lines:

%lowess(data=boxtid,x=t_educ,y=prestige,id=job,

f=.667,interp=rl);

%lowess(data=boxtid,x=t_income,y=prestige,id=job,

f=.667,interp=rl);

Plots

ofPrestige

vs.E

duc2

andlog(Incom

e)show

thatbothvariables

arenow

approx.linearly

relatedto

Prestige.

SP

IDA

200445

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Box-T

idw

elltransfo

rmatio

ns:

Exam

ple

Partialregression

plotsfor

thetransform

edvariables

showthatseveral

observationsare

influentialforthe

choiceofpow

erfor

Income.

BT

po

we

r: 2

Ge

ne

ral_

ma

na

ge

rs

Vo

ca

tion

al_

co

un

s

Min

iste

rs

Ph

ysic

ian

sVe

terin

aria

ns

Fa

rme

rs

Partial Prestige

-20

-10 0

10

20

Pa

rtial C

on

stru

cte

d V

aria

ble

(Ed

uc)

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

BT

po

we

r: 0

Ge

ne

ral_

ma

na

ge

rs

Vo

ca

tion

al_

co

un

s

Min

iste

rs

Ph

ysic

ian

s

Ve

terin

aria

ns

Fa

rme

rs

Partial Prestige

-20

-10 0

10

20

Pa

rtial C

on

stru

cte

d V

aria

ble

(Inco

me

)-2

00

0-1

00

00

10

00

20

00

30

00

40

00

50

00

60

00

70

00

SP

IDA

200444

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity

Classicallinear

models

(AN

OV

A,regression)

assume

constantresidualvariance

y=

,V

ar(ε)=

Var(y|X

)=

σ2

=constant

Diag

no

sis:

AN

OV

A:exam

inevariability

(IQR

,std.dev.)

ofresidualsby

groups

•P

lotmeans±

1std.

error(meanplot

macro)

•B

oxplotsofresiduals

vs.predicted(boxplot

macro)

•S

preadvs.

levelplots—P

lotlog(IQR)

vs.log(Med)

(sprdplot

macro)

Regression:

examine

variability(IQ

R,std.

dev.)ofresiduals

byx

ory

•D

ividex

ory

intogroups

(e.g.,deciles)—plots

asfor

AN

OV

A

•S

preadvs.

levelplots:P

lotlog(|ei |)

vs.log(x)

Treating

the

disease:

Fix

thedata:

Variance

stabilizingtransform

ation,y→y

p

Fix

them

odel:

•W

LSestim

ation(w

eights,wi ∼

1/σ

2i)

•U

sea

generalizedlinear

model

SP

IDA

200447

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Prestige score

10

20

30

40

50

60

70

80

90

Sq

ua

red

Ed

uc

01

00

20

03

00

Prestige score

10

20

30

40

50

60

70

80

90

Lo

g In

co

me

67

89

10

11

The

lowesttw

ooccupations

onlog(Incom

e)should

belooked

atmore

closely.

SP

IDA

200446

MichaelFriendly

Page 13: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Surv

ival tim

es o

f anim

als

: Means

Tre

atm

en

tA

BC

D

Survival time (hrs)

2 3 4 5 6 7 8 9

PO

ISO

N1

23

Surv

ival tim

es o

f anim

als

: Resid

uals

vs. P

red

RE

SID-4 -3 -2 -1 0 1 2 3 4 5

Pre

dic

t2

34

56

78

9

Both

plotsshow

greatervariance

associatedw

ithlonger

survivaltime.

Why

shouldw

enotbe

surprised?

SP

IDA

200449

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity:A

NO

VA

Survivaltim

eofanim

als:E

xposedto

poison,thengiven

treatment(B

oxand

Cox,

1964)

Plotm

eans±1

std.error

(meanplot

macro)

Boxplots

ofresidualsvs.predicted

(boxplot

macro)

Trick:values

ofyconstantw

ithincells

animals.sas

1%meanplot(data=animals,

2class=poisontreatmt,

/*factors

*/

3response=time);

/*response*/

45*--Fitfull2-waymodel,getoutputdataset;

6procglmdata=animals;

7classpoisontreatmt;

8modeltime=

poison|

treatmt;

9outputout=resultsp=predictr=resid;

1011*--Boxplotofresidualsvs.predicted;

12%boxplot(data=results,class=Predict,var=resid);

SP

IDA

200448

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Slo

pe: 2

.00

Pow

er: -1

.0

A1

A2

A3

B1

B2

B3

C1

C2

C3

D1

D2

D3

Spre

ad - L

evel p

lot

log Spread

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

log

Me

dia

n tim

e0

.30

.40

.50

.60

.70

.80

.91

.0

Meanplo

t of 1

/Tim

e

Tre

atm

en

tA

BC

D

1/TIME

-50

-40

-30

-20

-10

PO

ISO

N1

23

The

plotsuggeststransform

ingT

ime→

1/Tim

e.

1/Tim

ealso

reducesapparentinteraction

ofPoison

*Treatm

ent

SP

IDA

200451

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity:A

NO

VA

Spread

vs.levelplots

(thesprdplot

macro)

Plotlog(spread)

vs.log(level)

e.g.,log(IQR

)vs.

log(Median)

Ifalinear

relationexists,w

ithslope

b,transformy→

yp,w

ithp

=1−

b.

···animals.sas

14%sprdplot(data=animals,

15class=poisontreatmt,

16var=time);

/*createst_time*/

1718*--Plotmeansoftransformedvariables;

19%meanplot(data=animals,

20class=poisontreatmt,

21response=t_time);

SP

IDA

200450

MichaelFriendly

Page 14: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Use

Spread

vs.levelploton

groupedx

···baseba.sas

6%sprdplot(data=grouped,

7class=decile,var=Salary);

8%boxplot(data=grouped,

9class=decile,var=logsal,id=name);

Slo

pe: 0

.97

Pow

er: 0

.0222

237.5

248 2

54

259

265

274

280

287

300

log Spread

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3.0

3.1

log

Me

dia

n S

ala

ry

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3.0

Kennedy

Schm

idt

Sm

ith

Sie

rra

Pasqua

Sax

Cla

rk

log Salary

4.5

5.0

5.5

6.0

6.5

Ba

tting

Ave

ra

ge

De

cile

01

23

45

67

89

logS

alaryis

againindicated

SP

IDA

200453

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity:R

egressio

n

Divide

anx

variableinto

orderedgroups

(e.g.,deciles)

baseba.sas

···1procrankdata=baseballout=groupedgroups=10;

2varbatavgc;

3ranksdecile;

Salary (in 1000$)

0

1000

2000

3000

Ca

re

er B

attin

g A

ve

ra

ge

180

200

220

240

260

280

300

320

340

360

Schofie

ld

Kennedy

Schm

idt

Virgil S

undberg

Brunansky

Sm

ithM

urphyW

infie

ld

Salary (1000$)

0

10

00

20

00

30

00

Me

dia

n B

attin

g A

ve

ra

ge

22

02

30

24

02

50

26

02

70

28

02

90

30

0

SP

IDA

200452

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity:C

om

plex

mo

dels

Fitm

odel,getfittedvalues

(y)

inoutputdataset

Divide

intoordered

groupsbased

onfitted

valueS

pread-levelplotoflog(IQ

R)

vs.log(M

edian)e.g.,S

LID,predicting

TotalWages

andS

alaries

pontwages.sas

1procglmdata=pontario;

2classsex21;

3modelttwgs28c=

sex21

4eage26c|eage26c

/*

Age,Age^2

*/

5yrsch18c|yrsch18c

/*

Yearsof

schooling&

^2*/

6vismn15;

/*

Visibleminority?

*/

7outputout=stats

r=residualp=fitted;

8run;

9procrankdata=statsout=groupedgroups=10;

10varfitted;

11ranksdecile;

1213%sprdplot(data=grouped,var=ttwgs28c,class=decile);

SP

IDA

200455

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dealin

gw

ithh

eterosced

asticity:R

egressio

n

Spread

vs.levelplots:

Plotlog(|e

i |/σ)

vs.log(x)Iflinear,w

ithslope

b,transformy→

yp,w

ithp

=1−

b.

Slope: 1.19

Pow

er: 0

log | RSTUDENT |-2 -1 0 1

log (X)

0.51.0

1.52.0

Artificialdata,generated

sothatσ

∼x

:P

ower

=0→

analyzelog(y)

SP

IDA

200454

MichaelFriendly

Page 15: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Part

3:F

itting

and

un

derstan

din

glin

earm

od

els

Fitting

linearm

odelsw

ithS

AS

Modeldiagnosis:

Leverageand

Influence

Visualizing

influence:P

artialregressionplots

Modelselection

SP

IDA

200457

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategiesSlope: 0.56

Pow

er: 0.5

.

0

1

2

34

5

67

8

9

SLID

: Model for T

otal wages and salaries

log IQR, ttwgs28c

3.7

3.8

3.9

4.0

4.1

4.2

4.3

4.4

4.5

log Median, ttw

gs28c

3.53.6

3.73.8

3.94.0

4.14.2

4.34.4

4.54.6

4.7

→A

nalysisof √

wages

SP

IDA

200456

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Fittin

glin

earm

od

elsin

SA

S:PROC

GLM

CLASS

statementfor

discretepredictors→

dumm

y(0/1)

variables

proc

glm

data=...;

classA

BC;

modely

=A;

/*one-way

ANOVA

*/

modely

=A

BC;

/*3-way,

main

effects

only

*/

modely

=A

|B

|C

@2;

/*3-way,

all

two-way

terms

*/

modely

=A

|B

|C;

/*full

3-way

ANOVA

*/

Nested

effects:“B

within

A”→

B(A

)

proc

glm

data=...;

classprov

districtschool;

modelreading=

prov

district(prov)school(districtprov);

“Mixed”

effects:discrete

andcontinuous

predictors

proc

glm

data=...;

classA

BC;

modely

=A

X;

/*one-way

ANCOVA

*/

modely

=A|B

X;

/*two-way

ANCOVA

*/

modely

=A

X(A);

/*separate

slopes

model

*/

modely

=A

|X;

/*test

equal

slopes,

A*X

*/

SP

IDA

200459

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Fittin

glin

earm

od

elsin

SA

S:PROC

GLM

PROC

GLM

One

(orm

ore)quantitative

responsevariable(s)

Multiple

responsevariables→

multivariate

analysesor

repeatedm

easures

GLM

modelsyntax:

regressioneffects

(covariates)

proc

glm

data=...;

modely

=X1;

/*

simple

linear

regression

*/

modely

=X1

X2

X3;

/*

multiple

linear

regression

*/

modely

=X1-X5;

/*

multiple

linear

regression

*/

modely

=wages--education;

/*

multiple

linear

regression

*/

modely

=X1

X1*X1

X1*X1*X1;

/*

polynomial

regression

*/

modely

=X1

X2

X1*X2;

/*

interaction

model

*/

modely

=X1

X2

X1*X1

X2*X2X1*X2;

/*

response

surface

*/

Bar

notation:A|

B|

C→A

BCA*B

A*C

B*C

A*B*C

proc

glm

data=...;

*--

same,

using

’|’

notation;

modely

=X1

|X1

|X1;

/*

polynomial

regression

*/

modely

=X1

|X2;

/*

interaction

model

*/

modely

=X1

|X1

|X2

|X2

@2;

/*

response

surface

*/

SP

IDA

200458

MichaelFriendly

Page 16: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Fittin

glin

earm

od

elsin

SA

S:PROC

REG

PROC

REG

One

(orm

ore)quantitative

responsevariable(s)

+E

xtensivefacilities

forregression

diagnostics

+M

odelselectionm

ethods:stepw

ise,forward,backw

ard

+PLOT

statement→

plotsofany

dataor

computed

variables

proc

glm

data=...;

modely

=X1

X2

X3

//*

MRA,

influence

stats

*/

influencepartial;

plotnqq.

*r.;

/*

Normal

QQ

plot

*/

modely

=X1-X5/

/*

MRA,

model

selection

*/

selection=

stepwisesle=0.10;

SP

IDA

200461

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Fittin

glin

earm

od

elsin

SA

S:PROC

GLM

REPEATED

statementfor

repeatedm

easuresanalysis—

univariateand

multivariate

tests

proc

glm

data=...;

classA

B;

modelt1-t4

=A

|B

/nouni;

/*2

Between,

1Within-S

factor

*/

repeatedtrials4

polynomial;

/*classical

univar.

analysis

*/

manovah=A

|B;

/*MANOVA

tests

*/

Mixed

andrandom

effectsm

odels

proc

glm

data=...;

classperson

age

sex;

randomperson;

/*

person

random,

nested

w/in

sex

*/

modely

=age|sex

age|person(sex);

test

h=sex

e=person(sex);

/*specify

error

terms

*/

test

h=ageage*sexe=age*person(sex);

/*specify

error

terms

*/

Handled

betterinPROC

MIXED

SP

IDA

200460

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lin

earm

od

elsin

SA

S:

Oth

erp

roced

ures

PROC

RSREG

:R

esponsesurface

models

Autom

aticallygenerates

allsquaredterm

s,x21 ,x

22 ,...,andinteraction

effects,

x1 x

2 ,x1 x

3 ,....→

simple

way

totestfor

(quadratic)non-linearity

andinteractions.

PROC

SURVEYREG

:R

egressionfor

sample

surveydata

Handles

complex

surveydesigns:

stratification,clustering,unequalweighting

PROC

LIFEREG

:Linear

models

forfailure-tim

edata

Response

canbe

left,rightorintervalcensored

More

generalerrordistributions

(extreme

value,exponential,...)

PROC

TRANSREG

:Linear

models

with

variabletransform

ations

Quantitative

variables:splines,response

surface,powers,ranks,...

Discrete

variables:dum

my

(CLASS

),optimalcategory

scores,...

SP

IDA

200463

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Fittin

glin

earm

od

elsin

SA

S:PROC

REG

−no

CLASS

statement—

mustcreate

dumm

yvariables

(DUMMY

macro)

−no

|notation—

mustcreate

interactionterm

s(INTERACT

macro)

data

test;

input

xy

group$

sex

$@@;

cards;

510

AM

812

AF

913

AM

10

18

BM

16

19

BM

10

16

BF

15

21

CM

13

19

CF

15

20

CM

;*--

Dummy

variables

for

Sex

and

Group;

%dummy(data=test,var

=sex

group,prefix=Sex_Gp_);

*--

Interaction

of

X*Sex;

%interact(data=test,v1=x,v2=Sex_F,names=XSex);

proc

printnoobs;run;

Produces:x

ygroup

sex

SEX_F

GP_A

GP_B

XSex

510

AM

01

00

812

AF

11

08

913

AM

01

00

10

18

BM

00

10

16

19

BM

00

10

10

16

BF

10

110

15

21

CM

00

00

13

19

CF

10

013

15

20

CM

00

00

SP

IDA

200462

MichaelFriendly

Page 17: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Un

usu

aldata:

Leverag

ean

dIn

flu

ence

“Unusual”

observationscan

havedram

aticeffects

onleastsquares

estimates

in

linearm

odels

Three

archtypicalcases:

•TypicalX

(lowleverage),bad

fit

•U

nusualX(high

leverage),goodfit

•U

nusualX(high

leverage),badfit

Influentialobservations:unusualin

bothX

andY.

Heuristic

formula:

Influence=

XLeverage

×Y

residual

SP

IDA

200465

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Lin

earm

od

elsin

SA

S:

Oth

erp

roced

ures

PROC

LOGISTIC

:Logistic

Regression

Logitandprobitm

odelsfor

binaryresponse

data

Models

forordinal

discreteresponses

PROC

GENMOD

:G

eneralizedlinear

models

Classicallinear

models,logitistic

andprobitm

odels(binary

data),log-linear

models,...

Analysis

ofcorrelateddata

viaG

eneralizedE

stimating

Equations

(GE

E)

PROC

MIXED

:M

ixedm

odels

Generalizes

standardG

LMto

providefor

correlatederrors

andnonconstant

variance

Provides

form

odellingboth

responsem

eans(fixed

effects)and

variance-covarianceparam

eters(random

effects)

Com

mon

scenarios:clustered/hierarchicaldata,and

repeated/longitudinaldata

SP

IDA

200464

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Dram

aticexam

ple:D

avis’dataon

reportedand

measured

weightofm

enand

wom

en

Se

lf-Re

po

rts o

f He

igh

t an

d W

eig

ht

Se

x o

f su

bje

ct

FM

Reported weight in Kg

40

50

60

70

80

90

10

0

11

0

12

0

13

0

Me

asu

red

we

igh

t in K

g2

04

06

08

01

00

12

01

40

16

01

80

SP

IDA

200467

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

y

20 30 40 50 60 70

x10

2030

4050

6070

80

y

20 30 40 50 60 70

x10

2030

4050

6070

8 0

y

20 30 40 50 60 70

x10

2030

4050

6070

80

y

20 30 40 50 60 70

x10

2030

4050

6070

8 0

Original data

O-

Low leverage, O

utlier

-LH

igh leverage, good fit

OL

High leverage, O

utlier

SP

IDA

200466

MichaelFriendly

Page 18: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Detectin

go

utliers:

Stu

den

tizedR

esidu

als

Ord

inary

residu

als:ei=

yi −

yi ,notusefulbecause:

Even

iferrors,εi

haveconstantvariance

(asassum

ed),residualsdo

not—

varianceofe

ivaries

inverselyw

ithleverage—

Var(e

i )=

σ2(1−

hi )

Outliers

onY

pulltheregression

line(surface)

toward

them

Stu

den

tizedresid

uals:

Standardized

residual(RS

TU

DE

NT

)calculated

fory

ideleting

observationi.

Using

subscript(−i)

tom

eandeleting

i,

RS

TU

DE

NT≡

e�i

=ei

s(−

i) √1−

hi

Gives

atestfor

“mean-shift”

outlierm

odel,H0

:E(y

i |X)�=

E(y

(−i) |X

)e�i ∼

t(n−p−

2)•→

|e�i |

>t1−

α/2 (n−

p−2)

signifcantapriori

•→

|e�i |

>t1−

α/2n (n−

p−2)

signifcantaposteriori

(Bonferroni)

SP

IDA

200469

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Measu

ring

Leverag

e

Leverag

e:m

easuredby

“Hatvalues,”

hi .

so-calledbecause

fittedvalues

canbe

expressedas

y=

Hy

For

simple

linearregression,h

i ∼(x−

x)2

For

ppredictors,h

i ∼squared

distanceofx

ifrom

centroid, x(M

ahalanobis

squareddistance)

Allhatvalues

rangefrom

1/n

to1,and

averageis

h=

(p+

1)/n

.

→observations

with

hi>

2h

(orh

i>

3h

insm

allsamples)

aretypically

considered“high

leverage”points

SP

IDA

200468

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Infl

uen

ced

iagn

ostics

with

SA

S

PROC

REG

influence

optionon

model

statementgives

printedvalues

inflplot

macro

Fits

modelusing

PROC

REG

,influencestatistics→

outputdataset

Plots

RS

TU

DE

NT

vs.Hatvalue,bubble

size∼C

ook’sD

orD

FF

ITS

Labels“notew

orthy”observations—

largeR

ST

UD

EN

Tand/or

Hatvalue

Show

nominalcutoffs

for“unusual”

values

Sim

ilarm

acros

inflogis

macro—

logisticregression

(PROC

LOGISTIC

)

inflglim

macro—

generalizedlinear

models

(PROC

GENMOD

)

See:h

ttp://www.math.yorku.ca/SCS/sssg/inflplot.html

SP

IDA

200471

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Infl

uen

ce=

Leverag

e×R

esidu

al

Co

ok’s

D:

Scale-invariant(squared

)m

easureof“distance”

between

β(all)

and

β(−

i)(deleting

obs.i)

CO

OK

Di ≡

Di= (

e2i

(p+

1)s2 )

×h

i

1−h

2i

“Large”values:

Di>

4/n

[orD

i>

4/(n−

p−1)]

DF

FIT

S:

Scaled

measure

of(signed)

changein

predictedvalue

fory

i ,deleting

obs.i

DF

FIT

Si=

yi −

y(−

i)

s(−

i) √h

i

= (ei

s(−

i) )×

√h

i

1−h

2i

“Large”values:|D

FF

ITS

i |>

2 √(p

+1)/

n

SP

IDA

200470

MichaelFriendly

Page 19: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Minister

Reporter

RR

Conductor

Contractor

RR

Enginee r

Duncan data: Influence P

lotB

ubble size: Cook’s D

istance

Studentized Residual-3 -2 -1 0 1 2 3 4

Leverage (Hat V

alue).02

.04.06

.08.10

.12.14

.16.18

.20.22

.24.26

.28

SP

IDA

200473

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Exam

ple:

Du

ncan

’sO

ccup

ation

alPrestig

eD

ata

PROC

REG

step,with

influence

option

duncinfl2.sas

···1%includedata(duncan);

2procregdata=duncan;

3modelprestige=IncomeEduc/

influence;

4idjob;

5run;

inflplot

macro:

···duncinfl2.sas

6title’Duncandata:InfluencePlot’;

7title2"Bubblesize:Cook’sDistance";

8%inflplot(data=duncan,

9y=Prestige,

/*response

*/

10x=IncomeEduc,

/*predictors

*/

11id=job,

/*ID

variable

*/

12bubble=cookd

/*bubble~Cook’sD*/

13);

SP

IDA

200472

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Partialreg

ression

plo

ts

Pro

blem

s

Correlated

predictors—O

rdinaryscatterplots

cannotshowthe

uniqueeffects

of

onepredictor,controlling

forothers

Jointinfluence—S

ingledeletion

diagnosticscannotshow

whether

setsof

observationsare

jointlyinfluential,or

offseteach

other

So

lutio

n:

Partialreg

ression

(add

ed-variab

le)p

lots

For

xk

,ploty|otherx

svs.x

k |otherx

s.(others≡

X[−

k])

y|others≡y

�k=

y−y

X[−

k]

xk |others≡

x�k

=x−

xX

[−k]

y�k

=residuals

fromregression

ofyon

X[−

k]x

�k=

residualsfrom

regressionofx

kon

X[−

k]→

uniquerelation

ofyto

xk

,controlling/adjustingfor

allotherx

s.

SP

IDA

200475

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Exam

ple:

Du

ncan

’sO

ccup

ation

alPrestig

eD

ata

Influenceon

coefficientsis

substantial:

Alln

=45

cases

ParameterEstimates

Parameter

Standard

Variable

Label

DF

Estimate

Error

tValue

Pr

>|t|

Intercept

Intercept

1-6.06466

4.27194

-1.42

0.1631

income

Income

10.59873

0.11967

5.00

<.0001

educ

Education

10.54583

0.09825

5.56

<.0001

Deleting

Minister,R

RC

onductor,RR

Engineer

ParameterEstimates

Parameter

Standard

Variable

Label

DF

Estimate

Error

tValue

Pr

>|t|

Intercept

Intercept

1-6.31736

3.67962

-1.72

0.0939

income

Income

10.93066

0.15375

6.05

<.0001

educ

Education

10.28464

0.12136

2.35

0.0242

SP

IDA

200474

MichaelFriendly

Page 20: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Partialreg

ression

plo

ts:E

xamp

le

PROC

REG

step,with

partial

option→printer

plots

duncan4.sas

···1%includedata(duncan);

2procregdata=duncan;

3modelprestige=IncomeEduc/

partial;

4idjob;

5run;

partial

macro:

high-resplots

···duncan4.sas

6%partial(data=duncan,

7yvar=Prestige,

/*

response

*/

8xvar=IncomeEduc,

/*

predictors

*/

9id=job,

/*

IDvariable

*/

10label=INFL

/*

labelinfluentialpts*/

11);

SP

IDA

200477

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Partialreg

ression

plo

ts:P

rop

erties

slopeofy

�kon

x�k

=bk

,theestim

ateofthe

(partial)regression

coefficient,βk

,in

thefullm

odel.

residualsfrom

theregression

linein

thisplot≡

residualsfory

inthe

fullmodel,i.e.,

y�k

=bk x

�k+

e

simple

correlationbetw

eeny

�kand

x�k

=partialcorrelation

between

yand

xk

with

theother

xvariables

partialledoutor

controlled.

plotshows

partialleverage

(∼x

�ik2)

andinfluence

SP

IDA

200476

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

-40-20

020

4060

-40

-20 0 20 40 60

Partial incom

e

Partial prestige

Minister

Reporter

RR

Conductor

Contractor

RR

Engineer

Minister

andR

RC

onductorare

jointlyinfluential—

decreaseslope

forIncom

e

SP

IDA

200479

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

-60-40

-200

2040

60-60

-40

-20 0 20 40 60 80

Partial educ

Partial prestige

Minister

Reporter

RR

Conductor

Contractor

RR

Engineer

Minister

andR

RC

onductorare

jointlyinfluential—

increaseslope

forE

ducation

SP

IDA

200478

MichaelFriendly

Page 21: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

3Dview

:D

eleting 3 influential cases

100 50

0

Education

0

50

100

Income

Prestige

-6

34

75

115

SP

IDA

200481

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

3Dview

:

Minister

Reporter

RR

Conductor

Contractor

RR

Engineer

Observations and F

itted Response S

urface

100 50

0

Education

0

50

100

Income

Prestige

-6

32

70

108

SP

IDA

200480

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

n:

Exam

ple

Resp

on

se:F

uelconsumption

(percapita)

inU

Sstates

Pred

ictors:

TAX

statetax

rateon

motor

fuel

DRIVERS

proportionoflicensed

drivers

ROAD

lengthoffederalhighw

ays

INC

percapita

personalincome

POP

populationfuelcp.sas

1%includedata(fuel);

2%cpplot(data=fuel,

3yvar=fuel,

/*

response

*/

4xvar=taxdriversroadincpop,

/*

predictors

*/

5gplot=CPFAIC,

/*

whattoplot*/

6plotchar=TD

RIP

/*

varlabels

*/

7);

SP

IDA

200483

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

nfo

rL

inear

Mo

dels

Op

po

sing

go

als

WantR

2large

orM

SE

small→

includem

anyvariables

Wantparsim

ony,simplicity

ofinterpretation,smallcostofdata

collection→include

fewvariables

Selectio

ncriteria

R2

=S

Sm

odelS

Stotal

•C

annotdecreaseas

pincreases

Mallow

’sC

p=

SS

EP

MS

Efull −

(n−2P

)•

Measures

totalsquarederror

(randomerror

+bias)

usingP

=p

+1

parameters

outofmavailable

inthe

fullmodel

•C

pdirectly

relatedto

incrementalF

pfor

testingpredictors

omitted

from

model,H

0:β

p+

1=

···=β

m=

0•

“Good”

models

(nobias):

Cp ≈

Por

Fp ≈

1•

Modelcom

parisons:plotC

por

Fp

vs.P(CPPLOT

macro)

SP

IDA

200482

MichaelFriendly

Page 22: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

CPPLOT

macro:

Fp

vs.pplot

D P T

DI

TD

DP

DR

TP

RP

TR

TD

I

TD

P

DR

ID

RP

DIP

TD

R

TR

P

TIP

TD

IP

TD

RI

TD

RP

DR

IP

Fuel C

onsumption across the U

S

F for Omitted Variables

0 10 20

Num

ber of parameters in m

odel

23

45

6

SP

IDA

200485

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

CPPLOT

macro:

Cp

vs.pplot

DP

TI

R

DI

TD

DP

DR

TP

RP

TR

TI

IPR

I

TD

I

TD

P

DR

ID

RP

DIP

TD

RT

RP

TIP

RIP

TR

I

TD

IP

TD

RI

TD

RP

DR

IP

TR

IP

TD

RIP

Fuel C

onsumption across the U

SMallows C(p)

0 10 20

Num

ber of parameters in m

odel

23

45

6

SP

IDA

200484

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

CPPLOT

macro:

AIC

vs.pplot

D P T I R

DI

TD

DP

DR TP

RP

TR TIIP R

I

TD

I

TD

P

DR

ID

RP

DIP

TD

R

TR

PT

IP

RIP

TR

I

TD

IPT

DR

IT

DR

PD

RIP

TR

IP

TD

RIP

Fuel C

onsumption across the U

S

Akaike’s information criterion

400

410

420

430

440

450

460

Num

ber of parameters in m

odel

23

45

6

SP

IDA

200487

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

nfo

rL

inear

Mo

dels

Parsim

ony

measu

res

Com

binebadness

offit(SS

E)

with

penaltyfor

more

parameters

(p)

•A

IC-

Akaike’s

Information

Criterion:

AIC

=n

ln(SS

E/n)

+2p

•B

IC-

Saw

a’sB

ayesianInform

ationC

riterion:larger

penalty

Sm

alleris

better

No

p-values—

theyare

modelcom

parisonstatistics,rather

thanteststatistics

Can

did

atem

od

els

All

criteriaand

modelselection

methods

shouldbe

consideredas

nominating

a

fewcandidate

models,to

beexplored

more

fully

Things

tow

orryabout:

•Influentialobservations—

afew

casescan

determine

choiceof“best”

model

•M

odelmisspecification—

nonlineareffects,om

ittedpredictors

SP

IDA

200486

MichaelFriendly

Page 23: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

np

roced

ures

Au

tom

atedselectio

n

Forw

ardselection

•S

tartwith

none

•A

teachstep,add

variablew

ithlargest

incrementalF

-value,if

Pr(F

)<

SLE

NT

ER

•U

ntil:no

remaining

variablehas

Pr(F

)<

SLE

NT

ER

Backw

ardelim

ination

•S

tartwith

all

•A

teachstep,rem

ovevariable

with

smallest

incrementalF

-value,if

Pr(F

)>

SLS

TAY

•U

ntil:allrem

ainingvariables

haveP

r(F)

>S

LSTAY

Stepw

iseselection

•S

tartwith

two

forward

steps

•A

lternate:forw

ardstep,backw

ardstep

ifPr(F

)>

SLS

TAY

•U

ntil:N

onecan

beadded

orrem

oved

Sound

good,butallaredangerous!

SP

IDA

200489

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

np

roced

ures

Allp

ossib

lereg

ression

s2

p−1

ofthem,e.g.,p

=10⇒

1023m

odelsR

ank“best”

candidatem

odelsvia

R2,C

p ,AIC

1proc

reg

data=fuel;

2modelfuel

=tax

driversroad

inc

pop

/3

selection=rsquarebest=3cp

aic;

R-SquareSelectionMethod

Number

in

Model

R-Square

C(p)

AIC

Variablesin

Model

10.4886

27.2658

423.68

drivers

10.2141

65.5021

444.30

pop

10.2037

66.9641

444.93

tax

--------------------------------------------------------------

20.6175

11.2968

411.73

driversinc

20.5567

19.7727

418.82

tax

drivers

20.5382

22.3532

420.78

driverspop

--------------------------------------------------------------

30.6749

5.3057

405.93

tax

driversinc

30.6522

8.4600

409.17

tax

driverspop

30.6249

12.2636

412.79

driversroad

inc

--------------------------------------------------------------

40.6956

4.4172

404.77

tax

driversinc

pop

40.6787

6.7723

407.37

tax

driversroad

inc

40.6687

8.1598

408.83

tax

driversroad

pop

--------------------------------------------------------------

50.6986

6.0000

406.30

tax

driversroad

inc

pop

SP

IDA

200488

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

But,influence

plotshows

severalinfluentialcases:

CA

IL

NVN

Y

RI

SD

TX

WY

Fuel C

onsum

ptio

n: In

fluence P

lot

Studentized Residual-3 -2 -1 0 1 2 3 4 5

Le

ve

rag

e (H

at V

alu

e)

.00

.10

.20

.30

.40

.50

.60

SP

IDA

200491

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Mo

delselectio

np

roced

ures

1procreg

data=fuel;

2modelfuel

=tax

driversroadinc

pop

/3

selection=stepwise;

Sum

mary

Output:

All

variablesleftin

the

modelare

significantat

the

0.1500level.

No

other

variablemet

the

0.1500significancelevel

for

entry.

Summaryof

StepwiseSelection

Variable

Partial

Model

Step

Entered

R-Square

R-Square

C(p)

FValue

Pr

>F

1drivers

0.4886

0.4886

27.265

43.94

<.0001

2inc

0.1290

0.6175

11.296

15.17

0.0003

3tax

0.0573

0.6749

5.305

7.76

0.0078

4pop

0.0207

0.6956

4.417

2.93

0.0942

Allcriteria

andprocedures

agreeon

thisas

bestor2nd

best

What’s

nottolike?

SP

IDA

200490

MichaelFriendly

Page 24: Michael Friendly - YorkU Math and Stats ·  SPIDA 2004 2 Michael Friendly. Review of Linear Models and Model Building Strategies General Linear Models: Overview

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Using

log(P

opulationdensity)

=log

(pop

/area

)alone:

R2

=.73

!

CA

NV

NY

RI

SD

TX

WY

Fuel data: C

onsidering population density

Fuel consumption (/person)

300

400

500

600

700

800

900

1000

log Population D

ensity

-3-2

-10

Adding

drivers

:R

2=

.79—

asim

ple,sensible,andinterpretable

model

SP

IDA

200493

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Partialresidualplots

suggestmodelm

isspecification:

fuel

was

measured

percapita—

why

shouldpop

matter?

→C

aliforniaaccounts

form

ostofeffectofpop

-.1-.1

.0.1

.1.2

-200

-100 0

100

200

300

400

Partial drivers

Partial fuel

CA

IL

NV

NY

RI T

X

WY

-5000-2500

02500

50007500

1000012500

-200

-100 0

100

200

300

Partial pop

Partial fuel

CA

IL

NV

NY

RI

TX

WY

Why

isfuel

under-predictedfor

Wyom

ing,over-predictedfor

Rhode

Island?

→suggests

tolook

atpopulationdensity

SP

IDA

200492

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Referen

cesB

ox,G.E

.P.andC

ox,D.R

.A

nanalysis

oftransformations

(with

discussion).Journalofthe

RoyalS

tatisticalSociety,S

eriesB

,26:211–252,1964.

Box,G

.E.P.and

Tidw

ell,P.W.

Transformation

oftheindependentvariables.

Technometrics,4:

531–550,1962.

Em

erson,J.D.and

Stoto,M

.A.

Exploratory

methods

forchoosing

power

transformations.

JournaloftheA

merican

StatisticalA

ssociation,77:103–108,1982.

Friendly,M.

SA

SS

ystemfor

StatisticalG

raphics.S

AS

Institute,Cary,N

C,1stedition,1991.

Friendly,M.

Visualizing

CategoricalD

ata.S

AS

Institute,Cary,N

C,2000.

Tukey,J.W.

Exploratory

Data

Analysis.

Addison

Wesley,R

eading,MA

,1977.

SP

IDA

200495

MichaelFriendly

Review

ofLinearM

odelsand

ModelB

uildingS

trategies

Data

analysis

=S

um

marizatio

n+

Exp

osu

re+

Too

ls+

Un

derstan

din

g

Effective

data

analysis

requ

iresS

um

marizatio

nan

dE

xpo

sure:

Su

mm

arization

:P

arameter

estimates

(β1 ,β

2 ,···)H

ypothesistests

(H0

1=

0)

Com

parisons&

differences(w

ithstandard

errors!)

Exp

osu

re:V

isualizedata

andpredicted

valuesU

nderstandpatterns

andtrends

Detectanom

alies

Effective

data

analysis

requ

iresTo

ols

and

Un

derstan

din

g

Too

ls:S

tatisticalmodels

andm

ethodsfor

complex

problems

Com

putationalandgraphicalm

ethodsm

ustbeavailable

andeasy

touse

Un

derstan

din

g:

Whatm

odelscan

do,andw

hattheycan’t

Whatto

changew

henthings

gow

rong

SP

IDA

200494

MichaelFriendly