Predicting Hospital Productivity from Capacity Metrics - A Linear Regression Model

PREDICTING HOSPITAL PRODUCTIVITY FROM

CAPACITY METRICS: A LINEAR REGRESSION

MODEL

Madeleine Organ

ACMS 30600: Prof. Huebner

April 27, 2016

Disclaimer: This paper was the final project for my Statistical Methods and Data

Analysis course. Its purpose is to demonstrate understanding of rigorous statistical

methods, not robust research. Its findings are therefore necessarily limited and should

be viewed only in this context.

I. INTRODUCTION

In this paper, I will analyze capacity metrics of hospitals in the top 200 Hospital Referral

Regions (HRRs) across America. I have constructed a linear regression model predicting medical

discharges per 1,000 Medicare enrollees (collapsed over gender) in 2012 from four capacity or

labor metrics: hosp.phys measures hospital-based physician rate (per 100,000 residents); hosp.rn

similarly measures hospital-based registered nurse rate (per 1,000 residents); hosp.emp measures

hospital employee rate (per 1,000 residents); finally, beds measures the acute care hospital bed

rate (per 1,000 residents). The different rates or density pool sizes should not affect the results of

the regression, but the slope parameter for hosp.phys, if significant, would simply be one

hundred times the size it would have been if hosp.phys was measured per 1,000 residents. It is

only in comparing the relative importance of the predictors (i.e., comparing the absolute values

of the slope parameters) that the effect of hosp.phys would need to be adjusted by 0.01 to

compare it with the slope parameters for hosp.rn, hosp.emp, and beds. In addition, data for

physician rates was collected in 2011 while other metrics were collected in 2012; while not ideal,

this should not cause a big disturbance in the model since it is reasonable to assume these rates

stay fairly similar, on average, from one year to the next. A summary of the variable names is

provided in Table 1, below.

Table 1: Variable Names and Descriptions for Model 1

Hosp.phys Hospital-Based Physicians per 100,000 Residents (collapsed over specialty), 2011

Hosp.rn FTE Hospital-Based Registered Nurses per 1,000 Residents, 2012

Hosp.emp FTE Hospital Employees per 1,000 Residents, 2012

Beds Acute Care Hospital Beds per 1,000 Residents, 2012

The discharges response variable essentially measures the productive capacity of the

hospital; a higher-producing hospital will discharge more patients. Because the discharges

variable is a rate as well, the population of the HRR does not confound the analysis. The goal of

this analysis is to determine the extent to which these four variables – rates of hospital

physicians, nurses, general employees, and beds – affect the overall productive capacity of the

hospital, as seen in their discharge rates. Significant results would give hospital strategists and

administrators insight into ways to target resources that will have the biggest impact on their

production. Healthcare decisions in general have huge implications on the surrounding

populations not only for the direct delivery of care but also because of the vast sums of money

spent on healthcare each year by individuals and governments. Studying healthcare data sets like

this one can provide insight into the most meaningful types of spending and indicate where

spending could be reduced, if necessary.

II. DATA

The data used in this report was obtained from the Dartmouth Atlas of Health Care1. This

online database houses Medicare data from hospitals and is grouped locally, regionally, and

nationally as well as indexed by hospital and their affiliated physicians. Medicare population as

used in this dataset includes individuals between the ages of 65 and 99 not enrolled in a risk-

bearing health maintenance organization. The insurance claims databases come from a federal

agency, the Centers for Medicare and Medicaid Services (CMS), that collects data for every

person and provider using Medicare health insurance. In addition, some data in the Atlas was

obtained from the U.S. Census, the American Hospital Association (AHA), the American

1 http://www.dartmouthatlas.org/data/table.aspx?ind=186&tf=34&ch=32&loc=&loct=3&addn=ind-140_ch-6_tf-

32,ind-139_tf-34,ind-135_tf-34,ind-138_tf-34&fmt=221

http://www.dartmouthatlas.org/data/table.aspx?ind=186&tf=34&ch=32&loc=&loct=3&addn=ind-140_ch-6_tf-32,ind-139_tf-34,ind-135_tf-34,ind-138_tf-34&fmt=221

http://www.dartmouthatlas.org/data/table.aspx?ind=186&tf=34&ch=32&loc=&loct=3&addn=ind-140_ch-6_tf-32,ind-139_tf-34,ind-135_tf-34,ind-138_tf-34&fmt=221

Medical Association (AMA), and the National Center for Health Statistics. Topics included in

this database include Medicare reimbursement data, demographics of the Medicare population of

each region, interaction with/use of the health care system (contact days, number of clinicians

seen, etc.), cancer screening, prescription drug use, quality/effective care, hospital use data,

medical discharge data, and hospital and physician capacity, among others. The topics utilized in

this project were mainly hospital use and capacity. The Dartmouth Atlas project has collected

data for 20 years, and all Atlas reports and publications are available on their web site. The Atlas

uses small area analysis, a population-based methodology, to focus on the experience of the

population living in a defined geographic area or using a specific hospital (alternative

methodologies consider only the number of procedures performed at a hospital, without

correcting for size of population served). The data used in this analysis was by Hospital referral

region (HRR); each represents a regional health care market and contains at least one hospital

that regularly performs cardiovascular and neuro surgeries. There are 306 designated HRRs, and

each has a minimum population size of 120,000; this analysis utilizes the top 200.

III. REGRESSION ANALYSIS

Exploratory data analysis via scatter plots indicates a positive correlation exists between

discharges and hospital nurses, discharges and general hospital employees, and discharges and

beds, as shown in Figure 1 (a) – (c) below:

Figure 1(a). Scatter Plot of Hospital Nurses vs. Discharges (2012)

Figure 1(b). Scatter Plot of Hospital Employees vs. Discharges (2012)

Figure 1(c). Scatter Plot of Beds vs. Discharges (2012)

As indicated in Table 2, below, the strongest correlation is between hospital nurses and

general hospital employees (correlation = 0.7682). This indicates that there is a possibility for a

severe multicollinearity problem; however, VIF analysis indicated that no such severe problem

exists (see Appendix, part C). Practically, it is reasonable that nurses and general hospital

employees would be correlated, as a large portion of hospital employees are nurses.

Table 2. Correlations Matrix for Predictor Variables

hosp.phys hosp.rn hosp.emp beds

hosp.phys 1.0000000 -0.2967407 -0.2433095 -0.4530859

hosp.rn -0.2967407 1.0000000 0.7682780 0.6540995

hosp.emp -0.2433095 0.7682780 1.0000000 0.5511914

beds -0.4530859 0.6540995 0.5511914 1.0000000

A first regression model (mod1) is fitted using all four predictor values – hospital

physicians, hospital nurses, general hospital employees, and beds – to predict the response

variable of discharges. The multiple coefficient of determination (R2mod1) is 0.3686; the adjusted

value (R2a, mod1) = 0.3557. A summary of the model is below:

> summary(mod1) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -76.666 -20.942 -0.593 20.254 76.463 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 124.584 26.243 4.747 3.98e-06 *** hosp.phys -1.041 0.756 -1.377 0.1702 hosp.rn 19.861 4.959 4.005 8.80e-05 *** beds 24.406 5.798 4.210 3.90e-05 *** hosp.emp -1.845 1.063 -1.735 0.0843 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.32 on 195 degrees of freedom Multiple R-squared: 0.3686, Adjusted R-squared: 0.3557 F-statistic: 28.46 on 4 and 195 DF, p-value: < 2.2e-16

A second, reduced model was constructed to test for the significance of variable of hospital

physicians, since it was not individually reported as significant in the summary of the initial

model above. The ANOVA F test was conducted comparing the following hypotheses:

H0: ̂ = 0; hosp.phys can be removed

Ha: ̂ ≠ 0; hosp.phys cannot be removed

The results of the ANOVA analysis are provided in the following R output:

Analysis of Variance Table Model 1: discharges ~ hosp.rn + beds + hosp.emp Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 196 181032 2 195 179290 1 1742.5 1.8952 0.1702

The p-value of 0.1702 > ( = 0.05); therefore, fail to reject H0. The variable hosp.phys can

be removed for the model, as its interaction with discharges is not significant.

While the reduction to Model 2 improved the model, in order to determine the optimal

model, two variable selection methods were used: first, StepAIC() and second, principal

components.

The StepAIC procedure indicated that the optimal model does not include the variable

hosp.phys; results of the analysis are given below:

> library(MASS) > optimal.hosp<-stepAIC(mod1) Start: AIC=1369.69 discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Df Sum of Sq RSS AIC - hosp.phys 1 1742.5 181032 1369.6 <none> 179290 1369.7 - hosp.emp 1 2768.9 182059 1370.8 - hosp.rn 1 14750.2 194040 1383.5 - beds 1 16293.4 195583 1385.1 Step: AIC=1369.62 discharges ~ hosp.rn + beds + hosp.emp Df Sum of Sq RSS AIC <none> 181032 1369.6 - hosp.emp 1 2821.4 183854 1370.7 - hosp.rn 1 14832.9 195865 1383.4 - beds 1 23318.8 204351 1391.9 > optimal.hosp$anova #display results Stepwise Model Path Analysis of Deviance Table Initial Model: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Final Model: discharges ~ hosp.rn + beds + hosp.emp Step Df Deviance Resid. Df Resid. Dev AIC 1 195 179289.9 1369.688 2 - hosp.phys 1 1742.486 196 181032.4 1369.623

Thus, the resultant second model (mod2) measures discharges from hospital nurses, general

hospital employees, and beds. Summary statistics for this model are given below.

> summary(mod2) Call: lm(formula = discharges ~ hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -73.978 -20.790 1.214 18.521 79.607 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 92.784 12.483 7.433 3.22e-12 *** hosp.rn 19.916 4.970 4.007 8.72e-05 *** beds 27.263 5.426 5.025 1.13e-06 *** hosp.emp -1.862 1.065 -1.748 0.0821 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.39 on 196 degrees of freedom Multiple R-squared: 0.3625, Adjusted R-squared: 0.3527 F-statistic: 37.15 on 3 and 196 DF, p-value: < 2.2e-16

Because of the correlation between hospital nurses and hospital employees (see Table 2), I

conducted an additional principal components analysis for this optimal model without

hosp.phys. If this third model retained predictive power, it might be more optimal than the

second model because it alleviates any multicollinearity. My interpretation of the scree plot in

Figure 3, below, indicates that two principal components should be used to replace the three

remaining variables.

Figure 3. Scree Plot used for Principal Components Analysis

The third model (mod3) was constructed predicting discharges from the first two principal

components; its summary statistics are reported below.

> mod3<-lm(discharges~PC1+PC2) > summary(mod3) Call: lm(formula = discharges ~ PC1 + PC2) residual: Min 1Q Median 3Q Max -70.237 -22.231 1.038 17.896 88.048 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 206.799 2.179 94.911 < 2e-16 *** PC1 -13.458 1.434 -9.384 < 2e-16 *** PC2 -11.985 3.202 -3.742 0.000239 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.81 on 197 degrees of freedom Multiple R-squared: 0.3413, Adjusted R-squared: 0.3346 F-statistic: 51.03 on 2 and 197 DF, p-value: < 2.2e-16

I conducted an investigation of both Model 2 and Model 3 to test for influential

observations and outliers; neither influential observations nor outliers were found in either model.

> #Check for influential observations > ##cooks.distance(mod2) > which(cooks.distance(mod2)>1) named integer(0) > which(cooks.distance(mod3)>1) named integer(0) > # Check for Outliers > which(abs(rstandard(mod2))>3) named integer(0) > which(abs(rstandard(mod3))>3) named integer(0)

In choosing a final, optimal model, I considered various factors, including adjusted2-R2,

AIC, and the ANOVA and StepAIC procedures. My results are summarized in Table 3 below.

Table 3(a). Summary of Metrics of Model Optimality

MODEL

NAME PREDICTORS

ADJUSTED

R2 AIC NOTES

mod1

hosp.phys

hosp.rn

hosp.emp

beds

0.3557 1369.69 By F-test (ANOVA), hosp.phys was not

deemed a significant predictor; this model

should be eliminated from consideration.

mod2

hosp.rn

hosp.emp

beds

0.3527 1369.62 Optimal model designated by StepAIC

mod3

PC1

PC2

0.3346 1374.166 Principal component analysis was conducted

on the already reduced model (i.e., the model

without hosp.phys). This model has no

multicollinearity.

After analysis of these factors, I choose Model 2, the model with hosp.rn,

hosp.emp and beds as predictors; it simultaneously has the highest (ideal) adjusted R-squared

value, lowest (ideal) AIC value, and was designated by the StepAIC procedure to be the optimal

model. A secondary candidate would be Model 3, which, by nature of its principal components,

has no multicollinearity, but it has a lower adjusted R-squared and higher AIC value, which

makes it less optimal than Model 2. Indeed, the criterion values are quite close; an argument for

2 Since the models under investigation here have differing numbers of variables, I use adjusted-R2

rather than simple

R2 to test for predictive power; as with R2, a higher adjusted R2 is more optimal.

Model 3 would be its lack of multicollinearity and its model simplicity (it uses only two

variables in comparison with three in Model 2). However, since there was truly no need to

perform the principal components analysis (the VIF analysis indicated that the multicollinearity

was not severe) and because retaining variables in their original form makes for more

straightforward analysis of slope parameters, I will use Model 2 for all further analysis.

Model 2, my final choice, gives:

𝑑𝑖𝑠𝑐ℎ𝑎𝑟𝑔𝑒𝑠 = 92.784 + 19.916 ∗ (ℎ𝑜𝑠𝑝. 𝑟𝑛) + 27.263 ∗ (𝑏𝑒𝑑𝑠) − 1.862 ∗ (ℎ𝑜𝑠𝑝. 𝑒𝑚𝑝)

Finally, I perform model diagnostics on Model 2 in order to determine that it does indeed

satisfy model assumptions. Figure 4 below shows a residuals vs. fitted value plot; its circular

pattern (indicative of no real pattern) indicates that the model assumptions3 are upheld.

Figure 4. Residuals vs. Fitted Values Plot

3 The assumptions of the model are as follows: the error term (1) is normally distributed and (2) has mean 0 with

(3) constant variance 2 , and (4) all pairs of error terms are uncorrelated

Figure 5, below, is a quantile-quantile plot that verifies Assumption (1); namely, that the

error term is normally distributed:

Figure 5. Quantile-Quantile Plot (Model 2)

Since the quantile-quantile plot is a straight line, Assumption 1 is confirmed; the error

term is normally distributed.

A bootstrap resampling technique was used to validate the model; that is, a sample was

drawn with replacement and measured against the true model to determine if the model was

over-fit. The Evaluation R2 obtained for this validation test was 0.3356, which is quite close to

the original (Apparent) R2 (0.3527). If the Evaluation R2 was substantially lower than the

Apparent R2, we might conclude that the model was over-fit and too optimistic; however, this is

not the case for Model 2. Thus, Model 2 (predicting discharges from hospital employees,

hospital nurses, and beds) is valid and we can trust that it will generalize well to new data.

IV. RESULTS

Since the Bootstrapping technique confirmed that Model 2 is valid, I now use it to

conduct inferences.

First, I determine a 95% Confidence Interval for �̂�hosp.rn; this will give a range in which

we are 95% confident the true value of hosp.rn will fall.

> # CI for B.hosp.rn > CI.hosp.rn.low<-19.916-(1.96)*4.970 > CI.hosp.rn.up<-19.916+(1.96)*4.970 > CI.hosp.rn.low; CI.hosp.rn.up

[1] 10.1748 [1] 29.6572

Thus, for the hospital nurses variable, we can be 95% confident that the true value of is

within the interval [10.1748, 29.6572]. Since this interval does not contain zero, we can also say

that the interaction between hospital nurses and discharges is significant; that is, we are 95%

confident the slope parameter between the two is not zero. Practically, this indicates that the

additional discharges from employing one additional nurse lies within the region [10.17, 29.66].

Hospital strategists can consider discharge payoffs such as these when making nurse labor force

hiring and scheduling decisions. The range is somewhat large, which is reflective of the low

predictive power of this model.

Likewise, I determine a 95% Confidence Interval for �̂�hosp.emp: we can be 95% confident

that the true value of hosp.emp is within the range [-3.9494, 0.2254]. This is a somewhat troubling

result, however, because it indicates that the value for the slope parameter could be zero; that is,

that there might be no relationship between hospital employees and discharges. Practically, this

gives a great deal of uncertainty to hospital planners as they make staffing decisions about

employee numbers. To check this result, I created two additional models and conducted F tests

for the model utility of each (see Appendix, part B, for summaries of these models):

> mod4<-lm(discharges~hosp.rn+beds) #model does not include hosp.emp or hosp.physn > anova(mod4,mod1) Analysis of Variance Table Model 1: discharges ~ hosp.rn + beds Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 197 183854 2 195 179290 2 4563.9 2.4819 0.08622 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > mod5<-lm(discharges~hosp.phys+hosp.rn+beds) #model does not include hosp.emp > anova(mod5,mod1) Analysis of Variance Table Model 1: discharges ~ hosp.phys + hosp.rn + beds Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 196 182059 2 195 179290 1 2768.9 3.0116 0.08425 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

For each test of model utility, the test of model utility is as follows

H0: ̂ = 0; subset can be removed

Ha: ̂ ≠ 0; subset cannot be removed

For Model 4, the p-value of 0.08622 > ( = 0.05); therefore, fail to reject H0. Both

variables hosp.phys and hosp.emp can be removed for the model, as its interaction with

discharges is not significant.

For Model 5, the p-value of 0.08425 > ( = 0.05); therefore, fail to reject H0. The

single variable hosp.emp can be removed for the model, as its interaction with discharges is

not significant.

This analysis indicates that hospital employees was not a significant variable and should

be removed from the model. However, as determined by the Bootstrap resampling validation

technique, Model 2 (which included hospital employees as a variable) was valid; moreover, it

was selected by the StepAIC procedure as the optimal model. Therefore, I will continue to use

hospital employees as a significant variable and Model 2 as my model of choice; however,

practically, the effect of additional hospital employees is either ambiguous (likely depends on

factors beyond the scope of this model, or is confounded by the generality of the variable) or

quite low in magnitude. It should be noted that both the confidence interval and hypothesis

testing require the standard error in their computations. Standard errors are driven upward by

high levels of multicollinearity. While there was no problem of severe multicollinearity (that is,

no VIF > 10), as documented earlier (see Appendix, part C), hosp.rn and hosp.emp did have a

high correlation (0. 7682). This could play a role in inflating the standard error, which widens the

confidence interval and inflates the p-value (thereby prompting a failure to reject H0, when it

perhaps should be rejected). For these reasons as well as the AIC analysis, I will retain Model 2

as my optimal model. An updated summary table of all the tested models is given below.

Table 3(b). Summary of Metrics of Model Optimality

Finally, I consider a confidence interval and prediction interval for the predicted

discharges of an observation under the following conditions (chosen because they are the

conditions for the HRR of South Bend, IN):

hosp.phys.val<-27.2 hosp.rn.val<-3.8 beds.val<-1.9 hosp.emp.val<-14.2

The estimated value for discharges is �̂� = 193.8255; this compares with the observed

value of y = 199.3 discharges (the discrepancy here is reflective of the relatively low predictive

power of this model).

y.hat.SB<-92.784+19.916*hosp.rn.val+27.263*beds.val-1.862*hosp.emp.val y.hat.SB [1] 193.8241

MODEL

NAME PREDICTORS

ADJUSTED

R2 AIC NOTES

mod1

hosp.phys

hosp.rn

hosp.emp

beds

0.3557 1369.69 By F-test (ANOVA), hosp.phys was not

deemed a significant predictor; this model

should be eliminated from consideration.

mod2

hosp.rn

hosp.emp

beds

0.3527 1369.62 Optimal model designated by StepAIC

mod3

PC1

PC2

0.3346 1374.166 Principal component analysis was conducted

on the already reduced model (i.e., the model

without hosp.phys). This model has no

multicollinearity.

mod4

hosp.rn

beds

0.346 1370.716 Performed because CI hypothesis test

indicated that hosp.emp was not a

significant predictor (likewise does not include

hosp.phys)

Mod5

hosp.phys

hosp.rn

beds

0.3491 1370.75 Performed because CI hypothesis test

indicated that hosp.emp was not a

significant predictor (however, does include

hosp.phys)

A confidence interval for �̂� was calculated to be [188.87, 198.78]; that is, one can be

95% confident that the mean discharges for all hospitals under the designated conditions is

contained in this interval.

CI.yhat.SB<-predict(mod2, data.frame(hosp.rn=3.8, beds=1.9, hosp.emp=14.2), interval = "confidence", level=0.95) CI.yhat.SB fit lwr upr 1 193.8255 188.8666 198.7843

A prediction interval for �̂� was calculated to be [133.68, 253.97]; that is, one can be 95%

confident that an individual hospital’s discharges under the designated conditions will fall in this

interval.

PI.yhat.SB<-predict(mod2, data.frame(hosp.rn=3.8, beds=1.9, hosp.emp=14.2), interval = "prediction", level=0.95) PI.yhat.SB fit lwr upr 1 193.8255 133.6846 253.9663

The prediction interval is logically wider because of the greater inherent variability in an

individual observation in comparison with a mean.

This inference may be the most useful for hospital strategists as they can now utilize the

model with their specific rates for hospital nurses, general hospital employees, and beds, and

predict how many discharges their hospital will amass per 1,000 Medicare enrollees in a given

year. Moreover, the model enables such strategists to simulate new configurations of resources

(at a far reduced cost than actually trying them) to determine the best outcomes possible from

their scarce resources.

V. CONCLUSIONS

While this model provides a good framework for beginning to understand the predictors

for hospital capacity as measured by discharge rates, its limited predictive power (low R2 and

adjusted R2) indicate that these variables are not adequate on their own to predict hospital

discharges. In particular, the R2 for Model 2 is 0.3527; however, this means that 100-35.27 =

64.73% of the change in discharges cannot be explained by the predictors in this model. This is

reasonable, as a huge limitation of the dataset is lack of data on quality of the doctors and nurses

in a given HRR. Differences in quality of care can impact healthcare consumers’ decisions about

which hospitals they choose for their care; for example, if a consumer is near the border of two

designated HRRs, they will likely choose their care based on the quality of physicians or

facilities in each, neither of which is accounted for in this model. In addition, the model uses the

simplifying assumption of discharges as a measure of hospital productivity; however, this can be

misleading since any one person (accounting for only one discharge) can amass a multitude of

tests and procedures during their stay in the hospital which are not reflected in the data. This

problem may be intensified by the fact that higher-quality hospitals typically take on sicker

patients because of their more skilled physicians and greater medical and technical resources. It

is these patients that will either amass many procedures and still count as simply one patient, or

alternatively, fail to survive treatment (that is, are not discharged) and would not appear in this

dataset at all. This model could be improved by further data assessing the total number of

procedures performed (perhaps even weighted to account for different difficulty or risk factors)

or patient hours in the hospital as a better measure of hospital productivity. Further, it should

include a measure of physician and other caregiver quality rather than simply a rate of how many

caregivers a particular HRR employs.

Analysis of this model prompts several future questions. This model found that the

number of physicians in a given HRR had no impact on the number of discharges; however, this

seems unlikely. Perhaps this potential error manifests itself in the argument above, that physician

quality, rather than quantity, should be considered, but more likely the lack of physician

component in predicting hospital is due to a general omitted variable bias: this model simply

does not test the right variables to obtain a reliable prediction. Further research should be done,

then, to determine the degree to which physicians affect hospital productivity, and in particular if

this relationship is driven by simply quantity or by quality. Another avenue for research would

be comparing the relative need for physicians in comparison with needs for nurses and other

hospital staff. Here the quantitative model may be useful, albeit limited by the unknown quality

of the employees – perhaps one exceptional nurse could do the work of three average ones, and

there is no place for such a distinction in this model. Hospital administrators can utilize the

results from analyses like these to make strategic planning and employment decisions in order to

maximize production under conditions of limited resources.

VI. APPENDIX

Part A. R Code:

#PREDICTING HOSPITAL DISCHARGES FROM CAPACITY MEASURES:

#read in data

data2<-read.table("Capacity2.txt", header=T)

attach(data2)

#____________________________________________________________________

# EXPLORATORY DATA ANALYSIS

#make scatterplots of X variables vs. Y

plot(hosp.phys,discharges)

plot(hosp.rn, discharges)

plot(hosp.emp, discharges)

plot(beds, discharges)

#____________________

#correlation matrix

cor(cbind(hosp.phys, hosp.rn, hosp.emp, beds))

#____________________

# check VIF

mod1a<-lm(hosp.phys~hosp.rn+hosp.emp+beds)

summary(mod1a)

R2a<-0.2054

VIFa<-1/(1-R2a)

mod1b<-lm(hosp.rn~hosp.phys+hosp.emp+beds)

summary(mod1b)

R2b<-0.6667

VIFb<-1/(1-R2b)

mod1c<-lm(hosp.emp~hosp.phys+hosp.rn+beds)

summary(mod1c)

R2c<-0.5944

VIFc<-1/(1-R2c)

mod1d<-lm(beds~hosp.phys+hosp.rn+hosp.emp)

summary(mod1d)

R2d<-0.5602

VIFd<-1/(1-R2d)

VIFa; VIFb; VIFc; VIFd

#____________________________________________________________________

# REGRESSION ANALYSIS

# full model

mod1<-lm(discharges~hosp.phys+hosp.rn+beds+hosp.emp)

summary(mod1)

#____________________

# conduct F test for removal of [hosp.phys] variable (this is the

#one of least significance)

mod2<-lm(discharges~hosp.rn+beds+hosp.emp) #reduced model

anova(mod2,mod1) # F test

#____________________

# variable selection/data reduction method:

# stepAIC:

library(MASS)

optimal.hosp<-stepAIC(mod1)

optimal.hosp$anova #display results

summary(mod2)

# -------------------

# principal components:

#find z-scores for original x's

x.scale<-scale(cbind(hosp.rn, beds, hosp.emp))

pca<-prcomp(x.scale)

#check the scree plot

screeplot(pca,type="lines")

#calculate principal components:

PC1<-pca$x[,1]

PC2<-pca$x[,2]

#use PC's in new model:

mod3<-lm(discharges~PC1+PC2)

summary(mod3)

#____________________

#Check for influential observations

##cooks.distance(mod2)

which(cooks.distance(mod2)>1)

which(cooks.distance(mod3)>1)

# Check for Outliers

which(abs(rstandard(mod2))>3)

which(abs(rstandard(mod3))>3)

#____________________

# indicate which is final model

extractAIC(mod1)[2]

extractAIC(mod2)[2]

extractAIC(mod3)[2]

#mod2!

#____________________

# perform model diagnostics

# residual vs. fitted plot

plot(mod2$fitted.values,mod2$residuals)

# normal plot of residuals - demonstrate model assumptions

plot(mod2$residuals)

qqnorm(mod2$residuals)

#--------------------------------------

## MODEL VALIDATION

####(3) Bootstrap validation

#Create new data set including the new variables

disch <- as.data.frame(cbind(discharges, hosp.rn,hosp.emp,beds))

n<-dim(disch)[1]

#Fit model to original data

mod.disch<-lm(discharges~hosp.rn+hosp.emp+beds)

summary(mod.disch)

###Draw a bootstrap sample, fit a linear regression model,

## (i) calculate the R^2 for this bootstrap model applied to bootstrap

data ,

## (ii)calculate the R^2 for this bootstrap model applied to the

original sample.

##Perform this process 100 times,recording quantities (i) and (ii) for

each bootstrap sample

##Then use formula Evaluation=[Apparent-average(bootstrap-test)]

#Set seed to reproduce results

set.seed(5)

#Set up vectors to compute (i) and (ii)

R2.boot<-NULL

R2.test<-NULL

#Begin bootstrap sampling

for (i in 1:100){

#Draw bootstrap sample, i.e., sample of n WITH REPLACEMENT from

original data

disch.boot<-disch[sample(1:n,n,replace=TRUE),]

#fit logistic model on bootstrap sample

mod.boot<-glm(discharges~hosp.rn+hosp.emp+beds,data=disch.boot)

##Calculate R^2 when applying bootstrap model to bootstrap data

(similar to routine for data splitting or CV)

fitted.boot<-NULL

for (i in 1:n){

fitted.boot<-

c(fitted.boot,(mod.boot$coef[1]+mod.boot$coef[2]*disch.boot$hosp.rn[i]

+mod.boot$coef[3]*disch.boot$hosp.emp[i]

+mod.boot$coef[4]*disch.boot$beds[i]

))

}

#Compute R^2 on bootstrap sample using model fit to bootstrap sample

R2.boot.i <- 1 - sum((disch.boot$discharges-

fitted.boot)^2)/sum((disch.boot$discharges-

mean(disch.boot$discharges))^2)

#accumulate for each bootstrap sample

R2.boot <- c(R2.boot,R2.boot.i)

##Calculate R^2 when applying bootstrap model to original data;

consider as our "test"

fitted.test<-NULL

for (i in 1:n){

fitted.test<-

c(fitted.test,(mod.boot$coef[1]+mod.boot$coef[2]*disch$hosp.rn[i]

+mod.boot$coef[3]*disch$hosp.emp[i]

+mod.boot$coef[4]*disch$beds[i]

))

}

#Compute R^2 on original sample using model fit to bootstrap sample

R2.test.i <- 1 - sum((disch$discharges-

fitted.test)^2)/sum((disch$discharges-mean(disch$discharges))^2)

#accumulate for each bootstrap sample

R2.test <- c(R2.test,R2.test.i)

}

#From above, Apparent R^2 (R^2 on original data fit with original

model) was 0.3527

evaluation <- 0.3527-mean(R2.boot-R2.test)

evaluation

#----------------------------------------------------

#___________________________________________________________________

#___________________________________________________________________

# SECTION 4: RESULTS

# CI for B.hosp.rn

CI.hosp.rn.low<-19.916-(1.96)*4.970

CI.hosp.rn.up<-19.916+(1.96)*4.970

CI.hosp.rn.low; CI.hosp.rn.up

#____________________

# CI for B.hosp.emp

CI.hosp.emp.low<--1.862-(1.96)*1.065

CI.hosp.emp.up<--1.862+(1.96)*1.065

CI.hosp.emp.low; CI.hosp.emp.up

# ------- The below is not used in analysis -------------#

# CI for B.beds

CI.beds.low<-27.263-(1.96)*5.426

CI.beds.up<-27.263+(1.96)*5.426

CI.beds.low; CI.hosp.rn.up

#--------------------------------------------------------#

mod4<-lm(discharges~hosp.rn+beds) # model does not include hosp.phys

or hosp.emp

summary(mod4)

anova(mod4,mod1)

extractAIC(mod4)[2]

mod5<-lm(discharges~hosp.phys+hosp.rn+beds) #model does not include

hosp.emp

summary(mod5)

anova(mod5,mod1)

extractAIC(mod5)[2]

#____________________

# CI and PI for fitted value similar to South Bend, IN:

hosp.phys.val<-27.2

hosp.rn.val<-3.8

beds.val<-1.9

hosp.emp.val<-14.2

y.hat.SB<-92.784+19.916*hosp.rn.val+27.263*beds.val-1.862*hosp.emp.val

y.hat.SB

CI.yhat.SB<-predict(mod2, data.frame(hosp.rn=hosp.rn.val,

beds=beds.val, hosp.emp=hosp.emp.val), interval = "confidence",

level=0.95)

CI.yhat.SB

PI.yhat.SB<-predict(mod2, data.frame(hosp.rn=hosp.rn.val,

beds=beds.val, hosp.emp=hosp.emp.val), interval = "prediction",

level=0.95)

PI.yhat.SB

#____________________

Part B: Summaries of Models

> summary(mod1) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -76.666 -20.942 -0.593 20.254 76.463 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 124.584 26.243 4.747 3.98e-06 *** hosp.phys -1.041 0.756 -1.377 0.1702 hosp.rn 19.861 4.959 4.005 8.80e-05 *** beds 24.406 5.798 4.210 3.90e-05 *** hosp.emp -1.845 1.063 -1.735 0.0843 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.32 on 195 degrees of freedom Multiple R-squared: 0.3686, Adjusted R-squared: 0.3557 F-statistic: 28.46 on 4 and 195 DF, p-value: < 2.2e-16 > summary(mod2) Call: lm(formula = discharges ~ hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -73.978 -20.790 1.214 18.521 79.607 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 92.784 12.483 7.433 3.22e-12 *** hosp.rn 19.916 4.970 4.007 8.72e-05 *** beds 27.263 5.426 5.025 1.13e-06 *** hosp.emp -1.862 1.065 -1.748 0.0821 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.39 on 196 degrees of freedom Multiple R-squared: 0.3625, Adjusted R-squared: 0.3527 F-statistic: 37.15 on 3 and 196 DF, p-value: < 2.2e-16 > summary(mod3) Call: lm(formula = discharges ~ PC1 + PC2) Residuals: Min 1Q Median 3Q Max -70.237 -22.231 1.038 17.896 88.048 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 206.799 2.179 94.911 < 2e-16 *** PC1 -13.458 1.434 -9.384 < 2e-16 *** PC2 -11.985 3.202 -3.742 0.000239 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.81 on 197 degrees of freedom Multiple R-squared: 0.3413, Adjusted R-squared: 0.3346 F-statistic: 51.03 on 2 and 197 DF, p-value: < 2.2e-16 > summary(mod4) Call: lm(formula = discharges ~ hosp.rn + beds) Residuals: Min 1Q Median 3Q Max -71.498 -21.105 1.814 17.660 82.158 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 90.046 12.449 7.233 1.02e-11 *** hosp.rn 14.305 3.813 3.751 0.000231 *** beds 26.310 5.427 4.848 2.52e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.55 on 197 degrees of freedom Multiple R-squared: 0.3525, Adjusted R-squared: 0.346 F-statistic: 53.63 on 2 and 197 DF, p-value: < 2.2e-16 > summary(mod5) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds) Residuals: Min 1Q Median 3Q Max -72.64 -21.70 1.07 19.07 80.24 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 122.3442 26.3456 4.644 6.26e-06 *** hosp.phys -1.0562 0.7598 -1.390 0.166074 hosp.rn 14.3015 3.8042 3.759 0.000225 *** beds 23.4198 5.7993 4.038 7.72e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.48 on 196 degrees of freedom Multiple R-squared: 0.3589, Adjusted R-squared: 0.3491 F-statistic: 36.57 on 3 and 196 DF, p-value: < 2.2e-16

Part C: Evaluating VIFs:

# check VIF

mod1a<-lm(hosp.phys~hosp.rn+hosp.emp+beds)

summary(mod1a)

R2a<-0.2054

VIFa<-1/(1-R2a)

mod1b<-lm(hosp.rn~hosp.phys+hosp.emp+beds)

summary(mod1b)

R2b<-0.6667

VIFb<-1/(1-R2b)

mod1c<-lm(hosp.emp~hosp.phys+hosp.rn+beds)

summary(mod1c)

R2c<-0.5944

VIFc<-1/(1-R2c)

mod1d<-lm(beds~hosp.phys+hosp.rn+hosp.emp)

summary(mod1d)

R2d<-0.5602

VIFd<-1/(1-R2d)

> VIFa; VIFb; VIFc; VIFd [1] 1.258495 [1] 3.0003 [1] 2.465483 [1] 2.273761

Part D. Checking for Influential Observations/Outliers

> #Check for influential observations > ##cooks.distance(mod2) > which(cooks.distance(mod2)>1) named integer(0) > which(cooks.distance(mod3)>1) named integer(0) > # Check for Outliers > which(abs(rstandard(mod2))>3) named integer(0) > which(abs(rstandard(mod3))>3) named integer(0)

Predicting Hospital Productivity from Capacity Metrics - A Linear Regression Model

Documents

Transcript of Predicting Hospital Productivity from Capacity Metrics - A Linear Regression Model