Lecture 15 - Regression Methods -...

1

Regression Methods for Survey Data

Professor Ron Fricker!Naval Postgraduate School!

Monterey, California!

3/26/13

Reading:!Lohr chapter 11!

2

Goals for this Lecture!

•  Linear regression!–  Review of linear regression, including assumptions!–  Coding nominal independent variables!–  Running multiple regression in JMP and R!

•  For complex sampling designs, must use R or other specialized software!

•  Logistic regression!–  Useful for binary and ordinal dependent variables!–  Running logistic regression in JMP and R!


3/26/13

Regression in Surveys!

•  Useful for modeling responses to survey question(s) as function of external data and/or other survey data!–  Sometimes easier/more efficient then high-

dimensional multi-way tables!–  Useful for summarizing how changes in the

independent variables (the Xs) affect the dependent variable (the Y )

3 3/26/13

Simple Linear Regression!

•  General expression for a simple linear regression model: !

–  and are model parameters!–  is the error or noise term!

•  Can think of it as modeling the expected value of Y for a given or particular value of x:

!–  So the result is a model of the mean of Y as a

linear function of some independent variable !3/26/13 4

Yi = β0 + β1xi + ε i

E Yi | xi( ) = β0 + β1xi

β1 β0ε

Linear Regression Assumptions !

•  Error terms often assumed independent observations from a distribution!

–  This encapsulates all the assumptions inherent in linear regression:!

•  The error terms are independent and identically distributed!

•  Dependent variable is normally distributed!•  Hence, it is not appropriate to apply this

methodology when the Y is discrete!–  So, Y can’t be based on closed-ended survey

questions such as a 5-point Likert scale!3/26/13 5

20 1~ ( , )i iY N xβ β σ+

2(0, )N σ

Applying Linear Models to Survey Data!

•  As a minimum the Ys must be continuous and must be able to transform it to be symmetric!–  Combinations of discrete survey data may be

approximately normally distributed!–  Factor analysis and principle components may be

useful in this regard!•  Given some data, we will estimate the

parameters with coefficients!

where is the predicted value of y 3/26/13 6

E Y | xi( ) ≡ y = β0 + β1xi

y

Linear Regression with Categorical Independent Variables!•  Because survey data often discrete, often

have discrete independent variable in model!–  For example, how to put “male” and “female”

categories in a regression equation?!–  Code them as indicator (dummy) variables!

•  Two ways of making dummy variables:!– Male = 1, female = 0 !

» Default in many programs !»  Easy to code and interpret!

– Male = 1, female = -1!» Default in JMP for nominal variables !»  Some consider results easier to interpret!

•  Consider a model with only gender as the independent variable…!

3/26/13 7

Example: Calculus Grade as a Function of Gender!

3/26/13 8

0/1 coding Compares calc_grade to a baseline group (here, females) Regression equation:

females: y = 80.41 - 0.48 x 0 males: y = 80.41 - 0.48 x 1

-1/1 coding

Compares each group to overall average

Regression equation: females: y = 80.18 + 0.24 x1 males: y = 80.18 + 0.24 x (-1)

How to Code k Levels!

•  Two coding schemes: 0/1 and 1/0/-1!–  Use k-1 indicator variables!

•  E.g., three level variable: “a,” “b,”, & “c”!•  0/1: use one of the levels as a baseline!

Ø Var_a = 1 if level=a, 0 otherwise!Ø Var_b = 1 if level=b, 0 otherwise!Ø Var_c – exclude as redundant (baseline)!

•  Example:!

3/26/13 9

How to Code k Levels (cont’d)!

•  1/0/-1: use the mean as a baseline!Ø Variable[a] = 1 if variable=a, 0 if variable=b,

-1 if variable=c!Ø Variable[b] = 1 if variable=b, 0 if variable=a,

-1 if variable=c!Ø Variable[c] – exclude as redundant!

•  Example!

3/26/13 10

Fitting Simple Linear Regression Models!

•  In JMP, can use either Analyze > Fit Y by X or Analyze > Fit Model !–  Fill in Y with (continuous) dependent variable !–  Put X in model by highlighting and then clicking X

or “Add”!–  Click “Run Model” when done!

•  In R, use the lm() function in the base package!–  Syntax lm(dep_var ~ indep_var, data=data.frame) –  Best to assign results to an object and then look at

results using the summary() function!

3/26/13 11

Example: New Student Survey!

•  Define a Y variable that represents the complete in-processing experience:!–  “In-processing Total” = sum(Q2a-Q2i)!

12 3/26/13

New Student Survey (continued)!

•  Dependent variable looks roughly continuous and approximately normally distributed:!

13 3/26/13

-2 -1 0 1 2

1015

2025

3035

4045

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> summary(new_student$IP_total) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 10.00 27.00 32.00 31.29 36.00 45.00 16

Summary Statistics from JMP!

Summary Statistics from R!

Example: Simple Linear Regression of Total Satisfaction on School in R!

3/26/13 14

> model_results <- lm(IP_total ~ CurricNumber, data=new_student) > summary(model_results) Call: lm(formula = IP_total ~ CurricNumber, data = new_student) Residuals: Min 1Q Median 3Q Max -22.3208 -4.1255 0.4348 4.7273 16.0217 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.565 1.465 22.907 <2e-16 *** CurricNumberGSEAS -4.587 1.795 -2.556 0.0116 * CurricNumberGSOIS -1.244 1.755 -0.709 0.4793 CurricNumberSIGS -2.292 1.909 -1.201 0.2316 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.027 on 151 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.05347, Adjusted R-squared: 0.03466 F-statistic: 2.843 on 3 and 151 DF, p-value: 0.03976

Result: GSBPP students most satisfied with mean score of 33.565!GSEAS least satisfied with mean score of 33.565-4.587=28.978!

Three 0/1 indicators for 4 schools!

From Simple to Multiple Regression!

•  Simple linear regression: One Y variable and one x variable ( )!

•  Multiple regression: One Y variable and multiple x variables!–  Like simple regression, we’re trying to model how

Y depends on x!–  Only now we are building models where Y may

depend on many xs!

3/26/13 15

Yi = β0 + β1xi + ε i

Yi = β0 + β1x1i + β2x1i ++ βk xki + ε i

Multiple Regression in JMP (Assuming Simple Random Sampling)!

•  In JMP, use Analyze > Fit Model to do multiple regression!–  Fill in Y with (continuous) dependent variable !–  Put Xs in model by highlighting and then clicking

“Add”!•  Use “Remove” to take out Xs!

–  Click “Run Model” when done!•  In R, same function and syntax as simple

linear regression, just add more terms:!–  lm(dep_var ~ indep_var1 +indep_var2 + indep_var3, data=data.frame)

16 3/26/13

Example: Revisiting Satisfaction with In-processing (1)!

3/26/13 17

GSEAS worst at in-processing? Or are CIVs and USAF least happy?

•  Note this differs from R output because of different coding!•  But solution is the same: GSEAS = 31.534 – 2.556 = 28.978!

Satisfaction with In-processing (2)!

3/26/13 18

Or are Singaporians unhappy? Making a new variable…

Satisfaction with In-processing (3)!

•  Final model?!

3/26/13 19

-25

-20

-15

-10

-5

0

5

10

15.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Equivalent Model in R (Assuming Simple Random Sampling)!

3/26/13 20

> summary(model_results) Call: lm(formula = IP_total ~ Type_Student, data = new_student) Residuals: Min 1Q Median 3Q Max -22.038 -4.290 0.962 4.000 12.962 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.500 4.810 3.430 0.000782 *** Type_StudentOther FORNAT 16.500 4.999 3.301 0.001209 ** Type_StudentSingapore 12.042 5.007 2.405 0.017400 * Type_StudentUS Air Force 7.833 6.210 1.261 0.209140 Type_StudentUS Army 17.167 5.121 3.352 0.001017 ** Type_StudentUS Marine Corps 11.786 5.454 2.161 0.032316 * Type_StudentUS Navy 15.538 4.871 3.190 0.001736 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.803 on 148 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.1306, Adjusted R-squared: 0.09539 F-statistic: 3.707 on 6 and 148 DF, p-value: 0.001838

Now, Regression with Complex Sampling !

•  Use the survey package and the svyglm() function!–  Don’t specify family option for linear regression!–  As with other “svy” functions, must first specify the

sampling design with the svydesign() function and then call it from the svyglm() function!

•  Open question: When fitting regression models, is it necessary to account for sampling design?!–  Not always clear: It may be possible with some

data to estimate relationships in the population without the design and/or weights!

3/26/13 21

Logistic Regression!

•  Logistic regression!–  Response (Y) is binary representing event or not!–  Model estimates probability of event occurrence !

•  In surveys, useful for modeling:!–  Probability respondent says “yes” (or “no”)!

•  Can also dichotomize other questions!–  Probability respondent in a (binary) class!

22

ln

pi

1− pi

⎛

⎝⎜⎞

⎠⎟= β0 + β1x1i + β2x2i +…+ βk xki + ε i

3/26/13

Three Numbers for the Same Idea!

•  Probability (p)!–  Example: Pr(Congress passes FY14 budget) = 1/3 = 0.333 !–  Easily interpretable – number between 0 and 1!

•  Odds: p /(1-p)!–  Example: Odds of an FY14 budget are 1/2 = 0.50!–  Still interpretable, but not as easy as probability – any

number > 0!•  Log odds: ln(p/1-p)!

–  Very difficult to interpret – any number from to !–  Log odds is often called “logit” !–  In spite of interpretation difficulty, the model is very useful

(and the logit can be transformed to get the probabilities)!

3/26/13 23

−∞ +∞

Con

tinuo

us

Cat

egor

ical

Dep

ende

nt o

r Res

pons

e!Independent or Predictor Variable!

Continuous Categorical

Linear regression!

Linear reg. w/ dummy variables!

Logistic regression!

Logistic reg. w/ dummy variables!

Where Logistic Regression Fits!

24 3/26/13

Why Logistic Regression?!

•  Some reasons:!–  Estimates of p bounded between 0 and 1!–  Resulting “S” curve fits many observed

phenomenon!–  Model follows the same general principles as

linear regression!•  In terms of surveys, discrete data very

common!–  If a variable is not already binary, often relatively

easy to convert into binary result (collapse appropriately across some categories)!

25 3/26/13

Estimating the Parameters!

•  s estimated via maximum likelihood!•  Given estimated s, probability is calculated

as!

•  In JMP, after Fit Model, red triangle > Save Probability Formula!–  Creates a new column containing the estimated

probabilities, one for each observation!•  In R, use the predict function on an object

that contains the output of svyglm() 26

pi =exp β0 + β1x1i + β2x2i +…+ βk xki( )

1+ exp β0 + β1x1i + β2x2i +…+ βk xki( )

3/26/13

ββ

Fitting Logistic Regression Models (Assuming Simple Random Sampling)!

•  In JMP, fit much like multiple regression: Analyze > Fit Model!–  Fill in Y with nominal binary dependent variable !–  Put Xs in model by highlighting and then clicking

“Add”!•  Use “Remove” to take out Xs!

–  Click “Run Model” when done!•  In R, use the svyglm function with the option family=quasibinomial()

3/26/13 27

Example: Logistic Regression in JMP for the New Student Survey Q1!

•  Dichotomize Q1 into “satisfied” (4 or 5) and “not satisfied” (1, 2, or 3)!

•  Model satisfied on Gender and Type Student!

3/26/13 28

Same Example in R!

3/26/13 29

> new_student$satisfied_ind <- as.numeric(new_student$X1>3) > model_results <- glm(satisfied_ind ~ Sex + Type_Student, data=new_student, family=binomial) > summary(model_results) Call: glm(formula = satisfied_ind ~ Sex + Type_Student, family = binomial, data = new_student) Deviance Residuals: Min 1Q Median 3Q Max -2.0435 0.3771 0.5144 0.5780 1.3350 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.3364 1.5411 -0.867 0.386 SexM 1.3364 0.6122 2.183 0.029 * Type_StudentOther FORNAT 1.7047 1.5151 1.125 0.261 Type_StudentSingapore -0.3630 1.4726 -0.247 0.805 Type_StudentUS Air Force -0.3226 1.9043 -0.169 0.865 Type_StudentUS Army 1.5056 1.5600 0.965 0.334 Type_StudentUS Marine Corps 2.6081 1.8284 1.426 0.154 Type_StudentUS Navy 1.9556 1.4565 1.343 0.179 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 178.83 on 161 degrees of freedom Residual deviance: 151.10 on 154 degrees of freedom (9 observations deleted due to missingness) AIC: 167.1 Number of Fisher Scoring iterations: 4

Interpreting the Output (1)!

•  Exponentiating both sides of!

gives!

•  Note that the left side is the odds for the ith observation!–  A useful quantity, and one that is easy to

associate the betas to when using a 0/1 coding scheme!

3/26/13 30

ln

pi

1− pi

⎛

⎝⎜⎞

⎠⎟= β0 + β1X1i + β2 X2i ++ βk Xki

pi

1− pi

= exp β0( )× exp β1X1i( )×× exp βk Xki( )


•  For example, in the “baseline group” all the indicators are 0, so

–  Thus, is the “baseline group” odds – here, civilian female: odds = exp(-1.3364) = 0.26

–  So, the odds a female civilian is satisfied are roughly 1 to 4

–  In terms of probability,

3/26/13 31

pi

1− pi

= exp β0( )

exp β0( )

pi =exp β0( )

1+ exp β0( ) =0.26

1+ 0.26= 0.206


•  Other groups are excursions from the baseline using appropriate values of independent variables

•  For example, consider Navy males:

–  Odds male US Navy officer is satisfied are 7 to 1 –  In terms of probability,

3/26/13 32

pi

1− pi

= exp −1.3364( )× exp(1.3364)× exp(1.9556) = 7.07

pi =exp β0 + βMale + βNavy( )

1+ exp β0 + βMale + βNavy( ) =exp 1.9556( )

1+ exp 1.9556( ) = 0.876

Compare Model Output to Raw Data!

33 3/26/13

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.3364 1.5411 -0.867 0.386 SexM 1.3364 0.6122 2.183 0.029 * Type_StudentOther FORNAT 1.7047 1.5151 1.125 0.261 Type_StudentSingapore -0.3630 1.4726 -0.247 0.805 Type_StudentUS Air Force -0.3226 1.9043 -0.169 0.865 Type_StudentUS Army 1.5056 1.5600 0.965 0.334 Type_StudentUS Marine Corps 2.6081 1.8284 1.426 0.154 Type_StudentUS Navy 1.9556 1.4565 1.343 0.179

Ordinal Logistic Regression!

•  An extension of binary logistic regression when there are k > 2 dependent variable categories!–  For the ordered responses , the model is!

!–  Note that the betas are constant across all the js;

the only difference are the alpha intercept terms!•  This is a proportional odds model!

–  In R, use polr() in the MASS library with SRS and svyolr() with complex designs!

3/26/13 34

ln

Pr(Yi ≤ j)1− Pr(Yi ≤ j)

⎛

⎝⎜⎞

⎠⎟=α j − β1X1i − β2 X2i −− βk Xki

1≤ j ≤ k −1

An Example: New Student Survey Modeling All Q1 Likert Scale Levels!

3/26/13 35

> model_results <- polr(as.factor(X1) ~ as.factor(Type_Student), data=new_student) > summary(model_results) Call: polr(formula = as.factor(X1) ~ as.factor(Type_Student), data = new_student) Coefficients: Value Std. Error t value as.factor(Type_Student)Other FORNAT 1.4633 1.282 1.1416 as.factor(Type_Student)Singapore -0.6413 1.263 -0.5076 as.factor(Type_Student)US Air Force -0.3956 1.722 -0.2298 as.factor(Type_Student)US Army 1.3211 1.330 0.9935 as.factor(Type_Student)US Marine Corps 1.7159 1.426 1.2032 as.factor(Type_Student)US Navy 1.4078 1.238 1.1374 Intercepts: Value Std. Error t value 1|2 -3.7123 1.3969 -2.6575 2|3 -1.6041 1.2281 -1.3062 3|4 -0.2294 1.2113 -0.1894 4|5 2.8807 1.2378 2.3272 Residual Deviance: 343.1792 AIC: 363.1792 (9 observations deleted due to missingness)


•  More complicated to understand this type of model!–  Most relevant are the coefficients!–  Each is the estimate of how much the log-odds

increases with a one unit change !•  Can calculate the probabilities as:!

–  The minus signs are correct: this is the way the polr() function parameterizes the model!

3/26/13 36

Pr Yi ≤ j( ) =exp α j − β1x1i − β2x2i −…− βk xki( )

1+ exp α j − β1x1i − β2x2i −…− βk xki( )


•  So, for Navy students,!

•  Similarly , , and!

•  Thus, the model estimates that!!

3/26/13 37

Pr Yi ≤1( ) = exp -3.7123-1.4078( )

1+ exp -3.7123-1.4078( ) = 0.006

Pr Yi ≤ 2( ) = 0.047 Pr Yi ≤ 3( ) = 0.163

Pr Yi ≤ 4( ) = 0.813

Pr Yi = j | Naval Officer( ) =

0.006,0.041,0.116,

0.650,0.187,

⎧

⎨

⎪⎪⎪

⎩

⎪⎪⎪

j = 1j = 2j = 3

j = 4j = 5


•  So, for Singaporean students,!

•  Similarly , , and!

•  Thus, the model estimates that!!

3/26/13 38

Pr Yi ≤1( ) = exp -3.7123+0.6413( )

1+ exp -3.7123+0.6413( ) = 0.044

Pr Yi ≤ 2( ) = 0.276 Pr Yi ≤ 3( ) = 0.602

Pr Yi ≤ 4( ) = 0.971

Pr Yi = j |Singapore Officer( ) =

0.044,0.232,0.326,

0.369,0.029,

⎧

⎨

⎪⎪⎪

⎩

⎪⎪⎪

j = 1j = 2j = 3

j = 4j = 5

Using R with Complex Sampling!

•  As we’ve seen, key is the survey package by Thomas Lumley!–  See http://faculty.washington.edu/tlumley/survey/!–  Other useful “svy” functions include svytable(), svyboxplot(), svyby(), svycdf(), svycoplot(), svycoxph()

–  Good reference text:!Lumley, T. (2010). Complex Surveys: A Guide to

Analysis using R. Wiley Series in Survey Methodology, John Wiley and Sons.!

3/26/13 39

Other Software for Analyzing Surveys With Complex Sampling Designs!

•  Stata!•  SAS!•  SPSS!•  SUDAAN!!ü Not Excel or JMP!

3/26/13 40

What We Have Just Learned!

•  Linear regression!–  Review of linear regression, including assumptions!–  Coding nominal independent variables!–  Running multiple regression in JMP and R!


•  Logistic regression!–  Useful for binary and ordinal dependent variables!–  Running logistic regression in JMP and R!


3/26/13 41

Lecture 15 - Regression Methods -...

Documents

Transcript of Lecture 15 - Regression Methods -...