Post on 22-Jul-2020
Chapter 9
중회귀분석과 상관
Multiple regression & correlation
2014/5/15
9.1 머리말 (intro)
• One Y& k independent variables
1, , kx x
Y
종속변수 (Dependent variable)
독립변수
(Independent variable)
반응변수 (Response variable)
설명변수(explanatory variable) 예측변수(predictor variable)
1, , kx x
9.2 중회귀모형 (Multiple Regression Model)
• 중회귀모형 (model)
• 회귀계수의 의미(Interpreting the coefficients)
e.g. 2 independent var’s
2
0 1 1 2 2
1, ,~ (0, )j
j j j k kj j
j n
y x x x e
e iid N
Independently & identically distributed
1 2
1 2
0 1 1 2 2
( : , : :
( :length of hospital stay, :length of hospital stay, previous visit, :age
Y x x
Y x x
Y x x e
, )
)
입원기간 과거입원회수 연령
• 가 0일 때 Y의 기대치 Centering 필요 E(Y|x1=x2=0)
0 1 2[ ( 0)]E Y x x
1, 2x x
1 2 1
2
1 1 2 1 2
0 1 2 0 1 2
y
ef
increment of E(Y) corresponding to unit increase of x1 when x2 is fixed
[ ( 1, )] [ ( , )]
( 1) ( )
x x
x
E y x a x b E y x a x b
a b a b
가 같은 값으로 남아 있을때 이 한 단위 증가할 때 의 기대치 의 증가값
의1
2 1 2
fect adjust y effect
Effect of x1 on Y after controlling the effect of x2
x
x x y
를 한 후의 의 에 대한
가 같은 값으로 남아 있을때 이 한 단위 증가할 때 의 기대치 의 증가값
9.3 중회귀방정식을 얻는 방법 estimating regression coef.
• 정규방정식 (normal equation)
• Estimate which minimize L
0 1 1 2 2
2
0 1 1 1 2 1 2 1
2
0 2 1 1 2 2 2 2
j j j
j j j j j j
j j j j j j
nb b x b x y
b x b x b x x x y
b x b x x b x x y
0 1 2, ,
22
0 1 1 2 2
0 1 2
0
j j j jL e y x x
dL dL dL
d d d
Homework
• 연습문제 9.3.4
9.4 중회귀방정식의 평가 evaluating regression model
• 중결정계수 (Multiple Coeff. Of Determination)
sum of squares, total=SS explained + SS unexplained
2
2
.12... 2
ˆy k
j
SST SSR SSE
y y SSRR
SSTy y
총변수=설명되는 자승합+설명되지 않는 자승합
ANOVA Table
2
0 1 2
0
( , )
: 0
:
. ( , - -1, ) then
k
A
b N ci i ii
H
H Not Ho
if V R F k n k reject H
each
Matrix
1 ( 1) ( 1) 1 1
1 11 21 1 0 1
2 12 22 2 1 2
1 2
1
1
1
( ) ( ) 2 ' '
2 ' 2 ' 0
(
n n k k n
k
k
n n n kn k n
X
y x x x
y x x x
y x x x
L y X y X y y X y X X
LX y X X
X
y
' ) '
ˆ 1( ' ) '
X X y
X X X y
1
1
11 21 1
11 12 1 12 22 2
21 22 2 13 23 3
1 2 1 2
1 2
2
1 1 1 2 1
ˆ ( )
1 1 1 1
1
1
1
k
n k
n k
k k kn n n kn
j j kj
j j j j j kj
LSE X X X Y
x x x
x x x x x x
X Yx x x x x x
x x x x x x
n x x x
x x x x x x
1
1
2
1
1 2ˆ ˆ( )
j
j j
kj j kj kj kj j
y
x y
x x x x x y
Var X X
1
1 2
2 2
1 1 1 2
2
2 1 2 2
0 0 1 0 2
0 1 1 1 2
0 2 1 2 2
00 01 02
1 2
01 11 12
02 12 22
when 2
ˆ
ˆvar( ) cov( , ) cov( , )
ˆcov( , ) var( ) cov( , )
ˆcov( , ) cov( , ) var( )
ˆ( )
j j
j j j j
j j j j
k
n x x
x x x x
x x x x
b b b b b
b b b b b
b b b b b
C C C
x x C C C
C C C
2
00 01 02
01 11 12
02 12 22
ˆ
1 0 0
0 1 0 ( )
0 0 1
C C C
I x x C C C
C C C
검정 (Testing)
0
.12
1 2 0
Hypothesis : 0
: 0
Test stat
( 1)
i
A i
i i
bi
bi y k ii
H
H
b
s
s s C
If t t n k then H
:
:
:
standard error
reject
9.5 중회귀방정식의 이용
• 특정한 값이 주어졌을 때 Y값의 하부모집단 평균에 대한 신뢰구간
• 특정한 값이 주어졌을 때 얻게 되는 Y값의 예측구간
iX
iX
2 2 2 2
.12 11 1 22 2 12 1 21 2, 1
1ˆ 2y j j j jn ky t s c x c x c x x
n
Application
Predicting Y for a given X
Estimating the mean of Y for a given X
9.6 질적 독립변수 (Qualitative indep. Var)
• 변수 (variable)
질적변수를 가변수(dummy variable)로 이용 (가변수: (0,1)의 값을 갖는것) 질적변수 k개 범주→k-1개의 가변수 사용
k categories -> k-1 dummy variables
양적(quantitative) 연속 -성적, 연령
Continuous-score, age
질적(qualitative) 범주 – 성별, 인종,직업
Categorical-sex, race, job
가변수의 예 (Examples of dummy var’s)
11 0
1
2 0
13 0
( , -5 , -5 , )
1 4 0
15 0
16 0
x
xotherwise
xotherwise
xotherwise
xotherwise
xotherwise
*
*
* 금
남자성별 여자
도시
농촌
흡연상태
거주지역
흡연자 금연자 년 내 금연자 연자 년 이상 금연자 비흡연자
흡연자
5년내 금연자
5년 이상 금연자
sex
male female
Residential area (urban, rural, suburban)
urban
rural
Smoking status (current smoker, ex-smoker(<=5yrs), ex-smoker(.5 yrs)
smoker
ex-smoker (<=5 years)
ex-smoker (>5 years)
보기 9.6.1
Case # Birth weight
Gestation (week)
Smk status of the mother
1
2
0 1 1 2 2
0 1 1
0 2 1 1
gestation (weeks)
1,
0
1: ( )
( )
( ) ( )
Y
x
x S N
E Y x x
E Y x
E Y x
for nonsmoker
for smoker
same slope ,
출생시 체중(birth weight, grams)
smoker산모의 흡연 smk status of the mother
nonsmoker
model
임신기간 주
different intercept
1
2 1 1 1 1
, x
ˆ
expected diff of birth weights between babies from smokers and nonsmokers
* ( | ) ( | )
E Y X x E Y X x
출
smoker non - smoker
임신기간이 같다고 할 때 주어진 값에 대해서
어머니가 흡연자인 경우와 어머니가 비흡연자인 경우의 생아의 체중의 차이
0
2
245
2* 5.83 2.0452
2
: significantally different.
( 330.3975 , 158.6825)
ˆ 0
ˆ( )
grams
reject H
b
Tse
ts
2
2
신뢰구간 (CI)
b
*95%
02
1smoker
Non-smoker
•If is significant -> slopes are diff btn smoker/nonsmoker
• If is significant -> intercepts are diff
→ not important without centering
3
2
0 1 1 2 2 3 1 2
0 1 1
0 2 1 3 1
2 : ( )
( )
( ) ( ) ( )
for nonsmoker
for smoker
different slope , different intercept
model E Y x x x x
E Y x
E Y x
Model 2 그림
임신기간
체중
0 2
1
1 3
nonsmoker
smoker
38week
0
2
•centering
2 1 1
1 1
0 1 1 2 2 3 1 2
0 1 1
0 1 1
380 1
0 2 1 3 1
38( )
( )
( )
( 38)
( ) ( ) ( )
x Y
x x
if x x week
E Y x x x x
E Y x
x
E Y x
=
fornonsmoker
일때 의 기대치가 된다 (의미 ,관심있는모수)
for smoker
는 = 0일때 기대치의 차이가 아니라 38일때
흡연자와 비흡연자의 기대치의
는
x
x
차이가 된다.
* 교훈 : 연속변수를 centering을 시켜주면 절편이 = 0일때의 기대치가 아니라
= 특정값 일때의 기대치가 되므로 더욱 의미 있게 된다.
* centering의 다른 효과 → x간의 mult - colinearity(공선성)를 약화시켜준다.
E(Y|x1=38)
E(Y|x1=38, smoker) -E(Y|x1=38, non-smoker)
Intercept becomes more meaningful after centering. Multi-colinearity becomes weaker after centering
•예제 9.6.2
effect age Trt
Model- 예제9.6.2
*
치료효과 (trt effect)
연령 (양적) age ( )
qualitative
quantitative
A치료방법(질적) trt ( )
B
for trt = A
1
2
3
50 1 1 2 2 3 3 4 1 2 1 3
0 2 1 4 1
50 3 1 1
*
*
1, if
1, if
( ) ( ) ( ) :
( ) ( ) ( ) :
Y
X
X trt
X trt
Y x x x x x x x
E Y x
E Y x
for trt = B
for trt = C0 1 1
( ) :E Y x
intercept & slope for reference cell C
: diff of intercepts (A-C), =0 ?
: diff of intercepts (B-C), =0 ?
: diff of slopes (A-C) , =0 ?
:diff of slopes (B - C), = 0?
0 1
2
3
4
5
, :
예제 9.6.2-mreg.sas
/* File mreg.sas
multiple regression for table
9.6.3 ;*/
data reg;
input effect age method $;
x1=age;x2=(method='A');x3=(me
thod='B');
x12=x1*x2;x13=x1*x3;
cards; 56 21 A
41 23 B
40 30 B
28 19 C
55 28 A
25 23 C
46 33 B
71 67 C
48 42 B
63 33 A
52 33 A
62 56 C
50 45 C
45 43 B
58 38 A
46 37 C
58 43 B
34 27 C
65 43 A
55 45 B
57 48 B
59 47 C
64 48 A
61 53 A
62 58 B
36 29 C
69 53 A
47 29 B
73 58 A
64 66 B
60 67 B
62 63 A
71 59 C
62 51 C
70 67 A
71 63 C
;
run;
proc reg;
model effect=x1 x2 x3 x12
x13;
output out=d p=pred;
id age method;
run;
proc sort;by method;
proc gplot;
plot effect*age=method/
legend;
symbol1 v='A' i=r c=c2
l=1;
symbol2 v='B' i=r c=c2
l=2;
symbol3 v='C' i=r c=c2
l=3;
run;
homework
• 연습문제 9.6.3
9.7 중상관모형 multiple correlation model
0 1 1
j jj k kj
i
y x x e
x y
x y
Y X
모수 확률변수일때
다변량 정규분포일때
와의 상관정도 → 중상관계수
와 가
와 가
와
Both x and y are random variables x and y ~ multivariate normal Multiple correlation can be used to see the correlation between them.
보기9.7.1
Serum cholesterol weight SBP
weight SBP
⇒혈청콜레스테롤은 수축기 혈압
중상관계수
*
1 2
2
.12
.12
0 .12
2
.12
2
.12
11.61 11.04 0.005
: : :
8.7876 1, 099.669
817.876.7437
1.099.669
.7437 .86
: 0
1~ ( , 1)
1 -
y
y
y Lk
y k
y k
F p
Y cholesterol X X
SSR SST
R
R
H
R n kF F k n k
R k
, 체중과 선형관계가 있다.
There is a significant linear association bwteen sreum cholesterol and (SBP and weight)
예제 9.7.1-mcorr.sas
/* file mcorr.sas
SAS example for Table 9.7.1
*/
data mcorr;
input chol weight sbp;
cards;
162.2 51.0 108
158.0 52.9 111
157.0 56.0 115
155.0 56.5 116
156.0 58.0 117
154.1 60.1 120
169.1 58.0 124
181.0 61.0 127
174.9 59.4 122
180.2 56.1 121
174.0 61.2 125
;run;
proc plot;
plot chol*weight chol*sbp
sbp*weight;
run;
proc corr;
var chol weight sbp ;
run;
proc corr;
var chol weight ;
partial sbp ;
run;
proc corr;
var chol sbp ;
partial weight ;
run;
proc corr;
var weight sbp ;
partial chol ;
run;
부분상관계수 (Partial Corr. Coef)
• 다른 변수의 효과를 제어한 상태에서의 관계조사
• Linear association after controlling for other covariates
.12 2
1
. . :y
e g r X
Y X
를 상수로 하고,
와 과의 상관성을 측정하는 부분상관계수
is a constant (is fixed to a same value),
partial corr. coef. bewteen and
.12 2
1
. . : when y
e g r X
Y X
∴혈청콜레스테롤치가 일정할 때
수축기 혈압과 체중간에는 유의한 상관관계가 존재한다고 결론을 내린다.
We may conclude there is significant linear association bwteen SBP
0
0 1.2...
21.2...1.2...
: 012.
11 2 1.948 8.425
21 .948
: 0
1
1
y k
y ky k
Hy
t
H
n kt r
r
and weight
when serum cholesterol is not changing (=after adjsting for cholesterol effect).
homework
• 연습문제 9.7.1
pair-wise plot by R ## pair.r
## put histograms on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
?swiss
summary(swiss)
cor(swiss)
pairs(swiss)
pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor)
pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor, diag.panel=panel.hist)
9.8 variable(model) selection
• Forward selection
• Backward elimination
• Stepwise selection
Mod18.sas /* file : mod18.sas
Multiple Regression Model with
stepwise selection */
Filename electric
'd:\myweb\int\electric.dat';
data peak;
infile electric ;
input housize 1-3 income 6-11
aircapac 14-16 applindx 19-23
family 26-28 peak 31-35 ;
label housize = 'House Size'
income = 'Family Income'
aircapac = 'Air Conditioning
Capacity'
applindx = 'Appliance Index'
family = 'Number of Family
Members'
peak = 'Peak Hour Electric
Load' ;
run;
proc reg data=peak;
model peak = housize income
aircapac applindx family
/selection=stepwise;
title 'Multiple Regression Model with
stepwise selection';
run;
proc reg data=peak outest=est;
model peak = housize income
aircapac applindx family
/selection=rsquare cp
adjrsq mse best=2 ;
title 'Multiple Regression Model with
stepwise selection';
run;
proc print;
title 'Actual Coefficients, etc.';
proc plot;
plot _cp_*_in_ ='C' _p_*_in_='*'/overlay
vaxis= 0 to 25 by 5 haxis=1 to 5
hpos=40 vpos=30;
title;
run;
homework
• 종합문제10- sas로