Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income...

19
Computer exercise 3: solutions Anna Lindgren 15 April 2019 Exercise 3: U.S. county demographic information cdi <- read.delim("../data/CDI.txt") cdi$region <- factor(cdi$region, levels = c(1, 2, 3, 4), labels = c("Northeast", "Midwest", "South", "West")) cdi$phys1000 <- 1000 * cdi$phys / cdi$popul cdi$crm1000 <- 1000 * cdi$crimes / cdi$popul (a) As seen in Figure 1 the number of physicians per 1000 inhabitants is very skewed, with large variability up, but not down. The logarithm is much more symmetric. (b) Fit all eight models, including the empty one with only the intercept. ### See the R-code for Lecture 5 where I did this: mod.1 <- lm(log(phys1000) ~ 1, data = cdi) mod.2 <- lm(log(phys1000) ~ percapitaincome, data = cdi) mod.3 <- lm(log(phys1000) ~ crm1000, data = cdi) mod.4 <- lm(log(phys1000) ~ pop65plus, data = cdi) mod.5 <- lm(log(phys1000) ~ percapitaincome + crm1000, data = cdi) mod.6 <- lm(log(phys1000) ~ percapitaincome + pop65plus, data = cdi) mod.7 <- lm(log(phys1000) ~ crm1000 + pop65plus, data = cdi) mod.8 <- lm(log(phys1000) ~ percapitaincome + crm1000 + pop65plus, data = cdi) sum.1 <- summary(mod.1) sum.2 <- summary(mod.2) sum.3 <- summary(mod.3) sum.4 <- summary(mod.4) sum.5 <- summary(mod.5) sum.6 <- summary(mod.6) sum.7 <- summary(mod.7) sum.8 <- summary(mod.8) sum.8 #> #> Call: #> lm(formula = log(phys1000) ~ percapitaincome + crm1000 + pop65plus, #> data = cdi) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1.68142 -0.28720 -0.02991 0.28371 2.29373 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -1.262e+00 1.308e-01 -9.648 < 2e-16 *** 1

Transcript of Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income...

Page 1: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

Computer exercise 3: solutionsAnna Lindgren15 April 2019

Exercise 3: U.S. county demographic information

cdi <- read.delim("../data/CDI.txt")cdi$region <- factor(cdi$region, levels = c(1, 2, 3, 4),

labels = c("Northeast", "Midwest", "South", "West"))cdi$phys1000 <- 1000 * cdi$phys / cdi$populcdi$crm1000 <- 1000 * cdi$crimes / cdi$popul

(a)

As seen in Figure 1 the number of physicians per 1000 inhabitants is very skewed, with large variabilityup, but not down. The logarithm is much more symmetric.

(b)

Fit all eight models, including the empty one with only the intercept.### See the R-code for Lecture 5 where I did this:mod.1 <- lm(log(phys1000) ~ 1, data = cdi)mod.2 <- lm(log(phys1000) ~ percapitaincome, data = cdi)mod.3 <- lm(log(phys1000) ~ crm1000, data = cdi)mod.4 <- lm(log(phys1000) ~ pop65plus, data = cdi)mod.5 <- lm(log(phys1000) ~ percapitaincome + crm1000, data = cdi)mod.6 <- lm(log(phys1000) ~ percapitaincome + pop65plus, data = cdi)mod.7 <- lm(log(phys1000) ~ crm1000 + pop65plus, data = cdi)mod.8 <- lm(log(phys1000) ~ percapitaincome + crm1000 + pop65plus, data = cdi)

sum.1 <- summary(mod.1)sum.2 <- summary(mod.2)sum.3 <- summary(mod.3)sum.4 <- summary(mod.4)sum.5 <- summary(mod.5)sum.6 <- summary(mod.6)sum.7 <- summary(mod.7)sum.8 <- summary(mod.8)

sum.8#>#> Call:#> lm(formula = log(phys1000) ~ percapitaincome + crm1000 + pop65plus,#> data = cdi)#>#> Residuals:#> Min 1Q Median 3Q Max#> -1.68142 -0.28720 -0.02991 0.28371 2.29373#>#> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) -1.262e+00 1.308e-01 -9.648 < 2e-16 ***

1

Page 2: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

#> percapitaincome 6.484e-05 5.258e-06 12.333 < 2e-16 ***#> crm1000 8.197e-03 7.826e-04 10.474 < 2e-16 ***#> pop65plus 1.417e-02 5.340e-03 2.654 0.00825 **#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 0.4457 on 436 degrees of freedom#> Multiple R-squared: 0.3623, Adjusted R-squared: 0.3579#> F-statistic: 82.55 on 3 and 436 DF, p-value: < 2.2e-16

Note that all three covariates are significant, which is good.

Collect all their R2adj and BIC values

crits <- data.frame(nr = seq(1, 8),name = c("intercept", "percapinc", "crm1000", "pop65",

"inc_crm", "inc_65", "crm_65", "inc_crm_65"),p = c(0, 1, 1, 1, 2, 2, 2, 3),r2adj = c(sum.1$adj.r.squared, sum.2$adj.r.squared,

sum.3$adj.r.squared, sum.4$adj.r.squared,sum.5$adj.r.squared, sum.6$adj.r.squared,sum.7$adj.r.squared, sum.8$adj.r.squared),

bic = AIC(mod.1, mod.2, mod.3, mod.4, mod.5, mod.6, mod.7,mod.8, k = log(nrow(cdi)))[, 2])

crits#> nr name p r2adj bic#> 1 1 intercept 0 0.000000000 743.6230#> 2 2 percapinc 1 0.194245926 653.6766#> 3 3 crm1000 1 0.126177615 689.3600#> 4 4 pop65 1 0.004744857 746.6137#> 5 5 inc_crm 2 0.348991910 564.9248#> 6 6 inc_65 2 0.198131208 656.6309#> 7 7 crm_65 2 0.135851083 689.5430#> 8 8 inc_crm_65 3 0.357872189 563.9603

The model with all three explanatory variables is the best since it has the highest R2adj and the lowest

BIC.

(c)

v <- influence(mod.8)$hatlimit.v <- 2 * (3 + 1) / nrow(cdi)cdi[v > 0.16, ]#> id county state area popul pop1834 pop65plus phys beds crimes higrads#> 6 6 Kings NY 71 2300664 28.3 12.4 4861 8942 680966 63.7#> bachelors poors unemployed percapitaincome totalincome region#> 6 16.6 19.5 9.5 16803 38658 Northeast#> phys1000 crm1000#> 6 2.112868 295.9867I.Kings <- 6

As seen in Figure 2, several counties have a high leverage, in particular Kings county, New York. This isdue to the unusually high crime rate, as seen in Figure 2(c).

2

Page 3: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

(d)

r <- rstudent(mod.8)pred.8 <- predict(mod.8)cdi[r == max(r), ]#> id county state area popul pop1834 pop65plus phys beds crimes#> 418 418 Olmsted MN 653 106470 29.3 10 1814 1437 4310#> higrads bachelors poors unemployed percapitaincome totalincome region#> 418 88 29.5 4.5 3.3 20515 2184 Midwest#> phys1000 crm1000#> 418 17.03766 40.48089I.Olmsted <- 418

As seen in Figure 3(a)-(d) most of the residuals lie within ±2. They seem to have constant variance andno non-linear trends. However, there are two counties with large residuals. Kings county, NY has a large,negative residual (fewer physicians than expected) while Olmsted, Montana has a large positive residual(more physicians than expected).

(e)

s.i <- influence(mod.8)$sigma

As seen in Figure 4, Olmsted, MN, which had a large residual, produced the largest decrease in s(i) whenleft out, with Kings county, NY as a close second.

(f)

limit.cook <- c(1, 4 / nrow(cdi))D <- cooks.distance(mod.8)

As seen in Figure~??, Kings county has had a large influence on the β-estimates. Olmsted has not hadan alarming influence.

(g)

dfb <- dfbetas(mod.8)summary(dfb)#> (Intercept) percapitaincome crm1000#> Min. :-0.2327468 Min. :-1.841e-01 Min. :-1.9517594#> 1st Qu.:-0.0169952 1st Qu.:-1.502e-02 1st Qu.:-0.0130113#> Median :-0.0013843 Median : 2.287e-04 Median : 0.0005067#> Mean : 0.0004313 Mean :-4.916e-05 Mean :-0.0008209#> 3rd Qu.: 0.0098904 3rd Qu.: 1.441e-02 3rd Qu.: 0.0165190#> Max. : 0.7465022 Max. : 1.862e-01 Max. : 0.3863246#> pop65plus#> Min. :-0.3306552#> 1st Qu.:-0.0099315#> Median : 0.0011489#> Mean :-0.0002691#> 3rd Qu.: 0.0158057#> Max. : 0.2165531limit.dfb <- c(-1, -2 / sqrt(nrow(cdi)), 0, 2 / sqrt(nrow(cdi)), 1)

As seen in Figure 6, Kings county as had a huge influence on the β-estimate for crm1000 (c), as might beexpected, and a quite large influence on the intercept (a). Olmsted has not had a huge influence on any

3

Page 4: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

of the parameters.

(h)

When Kings county is taken out, the largest model is still the best, no county has a huge leverage or alarge influence on the parameter estimates. However, Olmsted still has a large residual.

Plots

# Figure 1.figcaption = "Physicians or log physicians. Kings County, NY in red, Olmsted, MN in blue."par(mfrow = c(3, 2))

with(cdi, plot(phys1000 ~ percapitaincome,main = "(a) Physicians vs per capita income"))

with(cdi[I.Kings, ], points(percapitaincome, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(percapitaincome, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ percapitaincome, log = "y",main = "(b) log Physicians vs per capita income"))

with(cdi[I.Kings, ], points(percapitaincome, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(percapitaincome, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ crm1000,main = "(c) Physicians vs crime"))

with(cdi[I.Kings, ], points(crm1000, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(crm1000, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ crm1000, log = "y",main = "(d) log Physicians vs crime"))

with(cdi[I.Kings, ], points(crm1000, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(crm1000, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ pop65plus,main = "(e) Physicians vs 65+ population"))

with(cdi[I.Kings, ], points(pop65plus, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(pop65plus, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ pop65plus, log = "y",main = "(f) log Physicians vs 65+ population"))

with(cdi[I.Kings, ], points(pop65plus, phys1000, col = "red", pch = 19))with(cdi[I.Olmsted, ], points(pop65plus, phys1000, col = "blue", pch = 8))

# Figure 2.figcaption <- "Leverage of the full model. Kings County, NY in red, Olmsted, MN in blue."par(mfrow = c(2, 2))

with(cdi, plot(id, v, main = "(a) Leverage against id"))points(I.Kings, v[I.Kings], col = "red", pch = 19)points(I.Olmsted, v[I.Olmsted], col = "blue", pch = 8)abline(h = limit.v)

with(cdi, plot(v ~ percapitaincome,main = "(b) Leverage against per capita income"))

with(cdi, points(percapitaincome[I.Kings], v[I.Kings], col = "red", pch = 19))with(cdi, points(percapitaincome[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))

4

Page 5: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

10000 20000 30000

05

1015

(a) Physicians vs per capita income

percapitaincome

phys

1000

10000 20000 30000

0.5

2.0

5.0

(b) log Physicians vs per capita income

percapitaincomeph

ys10

00

0 50 100 150 200 250 300

05

1015

(c) Physicians vs crime

crm1000

phys

1000

0 50 100 150 200 250 300

0.5

2.0

5.0

(d) log Physicians vs crime

crm1000

phys

1000

5 10 15 20 25 30 35

05

1015

(e) Physicians vs 65+ population

pop65plus

phys

1000

5 10 15 20 25 30 35

0.5

2.0

5.0

(f) log Physicians vs 65+ population

pop65plus

phys

1000

Figure 1: Physicians or log physicians. Kings County, NY in red, Olmsted, MN in blue.

5

Page 6: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

abline(sh = limit.v)#> Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "sh"#> is not a graphical parameter

with(cdi, plot(v ~ crm1000, main = "(c) Leverage against crime per 1000"))with(cdi, points(crm1000[I.Kings], v[I.Kings], col = "red", pch = 19))with(cdi, points(crm1000[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))abline(h = limit.v)

with(cdi, plot(v ~ pop65plus, main = "(d) Leverage against population 65+"))with(cdi, points(pop65plus[I.Kings], v[I.Kings], col = "red", pch = 19))with(cdi, points(pop65plus[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))abline(h = limit.v)

# Figure 3.figcaption = "Studentized residuals. Kings County, NY in red, Olmsted, MN in blue"par(mfrow = c(2, 2))

plot(r ~ pred.8, main = "(a) Studentized residuals against predicted values")points(pred.8[I.Kings], r[I.Kings], col = "red", pch = 19)points(pred.8[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8)abline(h = c(-2, 0, 2), lty = 2)

with(cdi,plot(r ~ percapitaincome,

main = "(b) Studentized residuals against per capita income"))with(cdi,

points(percapitaincome[I.Kings], r[I.Kings], col = "red", pch = 19))with(cdi,

points(percapitaincome[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

with(cdi,plot(r ~ crm1000,

main = "(c) Studentized residuals against crime per 1000"))with(cdi,

points(crm1000[I.Kings], r[I.Kings], col = "red", pch = 19))with(cdi,

points(crm1000[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

with(cdi,plot(r ~ pop65plus,

main = "(d) Studentized residuals against poulation 65+"))with(cdi,

points(pop65plus[I.Kings], r[I.Kings], col = "red", pch = 19))with(cdi,

points(pop65plus[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

# Figure 4.figcaption <- "Observations' effect on the sigma estimate. Kings county, NY in red, Olmsted, MN in blue"

with(cdi, plot(s.i ~ id, main = "Leave-one-out sigma-estimates"))with(cdi, points(id[I.Kings], s.i[I.Kings], col = "red", pch = 19))with(cdi, points(id[I.Olmsted], s.i[I.Olmsted], col = "blue", pch = 8))

# Figure 5figcaption = "Cood's distance. Kings county, NY in red, Olmsted, MN in blue"

6

Page 7: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

0.00

0.05

0.10

0.15

(a) Leverage against id

id

v

10000 20000 30000

0.00

0.05

0.10

0.15

(b) Leverage against per capita income

percapitaincome

v

0 50 100 150 200 250 300

0.00

0.05

0.10

0.15

(c) Leverage against crime per 1000

crm1000

v

5 10 15 20 25 30 35

0.00

0.05

0.10

0.15

(d) Leverage against population 65+

pop65plus

v

Figure 2: Leverage of the full model. Kings County, NY in red, Olmsted, MN in blue.

7

Page 8: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0.0 0.5 1.0 1.5 2.0 2.5

−4

−2

02

4

(a) Studentized residuals against predicted values

pred.8

r

10000 20000 30000

−4

−2

02

4

(b) Studentized residuals against per capita income

percapitaincome

r

0 50 100 150 200 250 300

−4

−2

02

4

(c) Studentized residuals against crime per 1000

crm1000

r

5 10 15 20 25 30 35

−4

−2

02

4

(d) Studentized residuals against poulation 65+

pop65plus

r

Figure 3: Studentized residuals. Kings County, NY in red, Olmsted, MN in blue

8

Page 9: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

0.43

20.

438

0.44

4Leave−one−out sigma−estimates

id

s.i

Figure 4: Observations’ effect on the sigma estimate. Kings county, NY in red, Olmsted, MN in blue

par(mfrow = c(2, 2))

with(cdi, plot(D ~ percapitaincome, ylim = c(0, 1),main = "(a) Cook's distans against per capita income"))

with(cdi, points(percapitaincome[I.Kings], D[I.Kings], col = "red", pch = 19))with(cdi, points(percapitaincome[I.Olmsted], D[I.Olmsted], col = "blue", pch = 8))abline(h = limit.cook)

with(cdi, plot(D ~ crm1000, ylim = c(0, 1),main = "(b) Cook's distancs against crime per 1000"))

with(cdi, points(crm1000[I.Kings], D[I.Kings], col = "red", pch = 19))with(cdi, points(crm1000[I.Olmsted], D[I.Olmsted], col = "blue", pch = 8))abline(h = limit.cook)

with(cdi, plot(D ~ pop65plus, ylim = c(0, 1),main = "(c) Cook's distance against population 65+"))

with(cdi, points(pop65plus[I.Kings], D[I.Kings], col = "red", pch = 19))with(cdi, points(pop65plus[I.Olmsted], D[I.Olmsted], col = "blue", pch = 8))abline(h = limit.cook)

# Figure 6figcaption = "DFbetas. Kings county, NY in red, Olmsted, MN in blue"par(mfrow = c(2, 2))

with(cdi, plot(dfb[, 1] ~ id, main = "(a) bfbeta for the intercept",ylim = c(-1, 1), ylab = "dfbeta_0"))

with(cdi, points(id[I.Kings], dfb[I.Kings, 1], pch = 19, col = "red"))with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 1], pch = 8, col = "blue"))abline(h = limit.dfb)

with(cdi, plot(dfb[, 2] ~ id, main = "(b) bfbeta for per capita income",ylim = c(-1, 1), ylab = "dfbeta_percapitaincome"))

with(cdi, points(id[I.Kings], dfb[I.Kings, 2], pch = 19, col = "red"))with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 2], pch = 8, col = "blue"))

9

Page 10: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

10000 20000 30000

0.0

0.6

(a) Cook's distans against per capita income

percapitaincome

D

0 50 100 150 200 250 300

0.0

0.6

(b) Cook's distancs against crime per 1000

crm1000

D

5 10 15 20 25 30 35

0.0

0.6

(c) Cook's distance against population 65+

pop65plus

D

Figure 5: Cood’s distance. Kings county, NY in red, Olmsted, MN in blue

abline(h = limit.dfb)

with(cdi, plot(dfb[, 3] ~ id, main = "(c) bfbeta for crime per 1000",ylim = c(-2, 1), ylab = "dfbeta_crm1000"))

with(cdi, points(id[I.Kings], dfb[I.Kings, 3], pch = 19, col = "red"))with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 3], pch = 8, col = "blue"))abline(h = limit.dfb)

with(cdi, plot(dfb[, 4] ~ id, main = "(d) bfbeta for population 65+",ylim = c(-1, 1), ylab = "dfbeta_pop65plus"))

with(cdi, points(id[I.Kings], dfb[I.Kings, 4], pch = 19, col = "red"))with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 4], pch = 8, col = "blue"))abline(h = limit.dfb)

(h) plots

cdi <- cdi[-I.Kings,]I.Olmsted <- 417

### (a) ###par(mfrow = c(3, 2))with(cdi, plot(phys1000 ~ percapitaincome,

main = "(a) Physicians vs per capita income"))with(cdi[I.Olmsted, ], points(percapitaincome, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ percapitaincome, log = "y",main = "(b) log Physicians vs per capita income"))

10

Page 11: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(a) bfbeta for the intercept

id

dfbe

ta_0

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(b) bfbeta for per capita income

id

dfbe

ta_p

erca

pita

inco

me

0 100 200 300 400

−2.

0−

1.0

0.0

1.0

(c) bfbeta for crime per 1000

id

dfbe

ta_c

rm10

00

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(d) bfbeta for population 65+

id

dfbe

ta_p

op65

plus

Figure 6: DFbetas. Kings county, NY in red, Olmsted, MN in blue

11

Page 12: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

with(cdi[I.Olmsted, ], points(percapitaincome, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ crm1000,main = "(c) Physicians vs crime"))

with(cdi[I.Olmsted, ], points(crm1000, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ crm1000, log = "y",main = "(d) log Physicians vs crime"))

with(cdi[I.Olmsted, ], points(crm1000, phys1000, col = "blue", pch = 8))

with(cdi, plot(phys1000 ~ pop65plus,main = "(e) Physicians vs 65+ population"))

with(cdi[I.Olmsted, ], points(pop65plus, phys1000, col = "blue",pch = 8))

with(cdi, plot(phys1000 ~ pop65plus, log = "y",main = "(f) log Physicians vs 65+ population"))

with(cdi[I.Olmsted, ], points(pop65plus, phys1000, col = "blue", pch = 8))

### (b) ###mod.1 <- lm(log(phys1000) ~ 1, data=cdi)mod.2 <- lm(log(phys1000) ~ percapitaincome, data=cdi)mod.3 <- lm(log(phys1000) ~ crm1000, data=cdi)mod.4 <- lm(log(phys1000) ~ pop65plus, data=cdi)mod.5 <- lm(log(phys1000) ~ percapitaincome+crm1000, data=cdi)mod.6 <- lm(log(phys1000) ~ percapitaincome+pop65plus, data=cdi)mod.7 <- lm(log(phys1000) ~ crm1000+pop65plus, data=cdi)mod.8 <- lm(log(phys1000) ~ percapitaincome+crm1000+pop65plus, data=cdi)sum.1 <- summary(mod.1)sum.2 <- summary(mod.2)sum.3 <- summary(mod.3)sum.4 <- summary(mod.4)sum.5 <- summary(mod.5)sum.6 <- summary(mod.6)sum.7 <- summary(mod.7)sum.8 <- summary(mod.8)sum.8#>#> Call:#> lm(formula = log(phys1000) ~ percapitaincome + crm1000 + pop65plus,#> data = cdi)#>#> Residuals:#> Min 1Q Median 3Q Max#> -1.04707 -0.27211 -0.03999 0.26980 2.31530#>#> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) -1.358e+00 1.303e-01 -10.420 < 2e-16 ***#> percapitaincome 6.515e-05 5.159e-06 12.627 < 2e-16 ***#> crm1000 9.696e-03 8.453e-04 11.470 < 2e-16 ***#> pop65plus 1.492e-02 5.242e-03 2.846 0.00464 **#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 0.4373 on 435 degrees of freedom#> Multiple R-squared: 0.3874, Adjusted R-squared: 0.3832#> F-statistic: 91.71 on 3 and 435 DF, p-value: < 2.2e-16

12

Page 13: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

10000 20000 30000

05

1015

(a) Physicians vs per capita income

percapitaincome

phys

1000

10000 20000 30000

0.5

2.0

10.0

(b) log Physicians vs per capita income

percapitaincomeph

ys10

00

0 50 100 150

05

1015

(c) Physicians vs crime

crm1000

phys

1000

0 50 100 150

0.5

2.0

10.0

(d) log Physicians vs crime

crm1000

phys

1000

5 10 15 20 25 30 35

05

1015

(e) Physicians vs 65+ population

pop65plus

phys

1000

5 10 15 20 25 30 35

0.5

2.0

10.0

(f) log Physicians vs 65+ population

pop65plus

phys

1000

Figure 7: Data without Kings, NY

13

Page 14: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

crits <- data.frame(nr=c(1:8),name=c("intercept", "percapinc", "crm1000", "pop65",

"inc_crm", "inc_65", "crm_65", "inc_crm_65"),p=c(0,1,1,1,2,2,2,3),r2adj=c(sum.1$adj.r.squared, sum.2$adj.r.squared,

sum.3$adj.r.squared, sum.4$adj.r.squared,sum.5$adj.r.squared, sum.6$adj.r.squared,sum.7$adj.r.squared, sum.8$adj.r.squared),

bic=AIC(mod.1, mod.2, mod.3, mod.4, mod.5, mod.6, mod.7,mod.8, k=log(nrow(cdi)))[,2])

crits#> nr name p r2adj bic#> 1 1 intercept 0 0.000000000 742.8673#> 2 2 percapinc 1 0.194625169 652.9279#> 3 3 crm1000 1 0.148216097 677.5229#> 4 4 pop65 1 0.004734591 745.8649#> 5 5 inc_crm 2 0.373171824 547.9777#> 6 6 inc_65 2 0.198497249 655.8909#> 7 7 crm_65 2 0.159063330 676.9752#> 8 8 inc_crm_65 3 0.383211552 545.9660

### (c) ###par(mfrow = c(2, 2))v <- influence(mod.8)$hatlimit.v <- 2 * (3 + 1) / nrow(cdi)with(cdi, plot(v ~ id, main = "(a) Leverage against id"))points(cdi$id[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8)abline(h = limit.v)

with(cdi, plot(v ~ percapitaincome, main = "(b) Leverage against per capita income"))with(cdi, points(percapitaincome[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))abline(h = limit.v)

with(cdi, plot(v ~ crm1000, main = "(c) Leverage against crime per 1000"))with(cdi, points(crm1000[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))abline(h = limit.v)

with(cdi, plot(v ~ pop65plus, main = "(d) Leverage against population 65+"))with(cdi, points(pop65plus[I.Olmsted], v[I.Olmsted], col = "blue", pch = 8))abline(h = limit.v)

### (d) ###par(mfrow = c(2, 2))r <- rstudent(mod.8)pred.8 <- predict(mod.8)

plot(r ~ pred.8, main = "(a) Studentized residuals against predicted values")with(cdi, points(pred.8[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

with(cdi, plot(r ~ percapitaincome,main = "(b) Studentized residuals against per capita income"))

with(cdi, points(percapitaincome[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

with(cdi, plot(r ~ crm1000,main = "(c) Studentized residuals against crime per 1000"))

with(cdi, points(crm1000[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))

14

Page 15: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

0.00

0.02

0.04

0.06

(a) Leverage against id

id

v

10000 20000 30000

0.00

0.02

0.04

0.06

(b) Leverage against per capita income

percapitaincome

v

0 50 100 150

0.00

0.02

0.04

0.06

(c) Leverage against crime per 1000

crm1000

v

5 10 15 20 25 30 35

0.00

0.02

0.04

0.06

(d) Leverage against population 65+

pop65plus

v

Figure 8: Leverage without Kings, NY

15

Page 16: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

abline(h = c(-2, 0, 2), lty = 2)

with(cdi, plot(r ~ pop65plus,main = "(d) Studentized residuals against poulation 65+"))

with(cdi, points(pop65plus[I.Olmsted], r[I.Olmsted], col = "blue", pch = 8))abline(h = c(-2, 0, 2), lty = 2)

### (e) ###s.i <- influence(mod.8)$sigma

with(cdi, plot(s.i ~ id, main = "Leave-one-out sigma-estimates"))with(cdi, points(id[I.Olmsted], s.i[I.Olmsted], col = "blue", pch = 8))

### (f) ###limit.cook <- c(1, 4 / nrow(cdi))D <- cooks.distance(mod.8)

with(cdi, plot(D ~ id, ylim = c(0, 1),main = "(a) Cook's distans against per capita income"))

with(cdi, points(id[I.Olmsted], D[I.Olmsted], col = "blue", pch = 8))abline(h = limit.cook)

### (g) ###par(mfrow = c(2, 2))dfb <- dfbetas(mod.8)limit.dfb <- c(-1, -2 / sqrt(nrow(cdi)), 0, 2 / sqrt(nrow(cdi)), 1)

with(cdi, plot(dfb[, 1] ~ id, main = "(a) bfbeta for the intercept",ylim = c(-1, 1), ylab = "dfbeta_0"))

with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 1], pch = 8, col = "blue"))abline(h = limit.dfb)

with(cdi, plot(dfb[, 2] ~ id, main = "(b) bfbeta for per capita income",ylim = c(-1, 1), ylab = "dfbeta_percapitaincome"))

with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 2], pch = 8, col = "blue"))abline(h = limit.dfb)

with(cdi, plot(dfb[, 3] ~ id, main = "(c) bfbeta for crime per 1000",ylim = c(-1, 1), ylab = "dfbeta_crm1000"))

with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 3], pch = 8, col = "blue"))abline(h = limit.dfb)

with(cdi, plot(dfb[, 4] ~ id, main = "(d) bfbeta for population 65+",ylim = c(-1, 1), ylab = "dfbeta_pop65plus"))

with(cdi, points(id[I.Olmsted], dfb[I.Olmsted, 4], pch = 8, col = "blue"))abline(h = limit.dfb)

16

Page 17: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0.0 0.5 1.0 1.5

−2

02

4

(a) Studentized residuals against predicted values

pred.8

r

10000 20000 30000

−2

02

4

(b) Studentized residuals against per capita income

percapitaincome

r

0 50 100 150

−2

02

4

(c) Studentized residuals against crime per 1000

crm1000

r

5 10 15 20 25 30 35

−2

02

4

(d) Studentized residuals against poulation 65+

pop65plus

r

Figure 9: residuals without Kings, NY

17

Page 18: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

0.42

50.

430

0.43

5Leave−one−out sigma−estimates

id

s.i

Figure 10: sigma without Kings, NY

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

(a) Cook's distans against per capita income

id

D

Figure 11: Cook’s distance without Kings, NY

18

Page 19: Computer exercise 3: solutions · 10000 20000 30000 0 5 10 15 (a) Physicians vs per capita income percapitaincome phys1000 10000 20000 30000 0.5 2.0 5.0 (b) log Physicians vs per

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(a) bfbeta for the intercept

id

dfbe

ta_0

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(b) bfbeta for per capita income

id

dfbe

ta_p

erca

pita

inco

me

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(c) bfbeta for crime per 1000

id

dfbe

ta_c

rm10

00

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

1.0

(d) bfbeta for population 65+

id

dfbe

ta_p

op65

plus

Figure 12: dfbetas without Kings, NY

19