Structural equation models : opportunities, risks and discussion of some applications in the travel...

37
Structural equation models : opportunities, risks and discussion of some applications in the travel behavior research domain Marco Diana, Politecnico di Torino (I) University of Maryland, College Park, 29th November 2014

Transcript of Structural equation models : opportunities, risks and discussion of some applications in the travel...

Structural equation models : opportunities, risks and discussion of some applications in the travel

behavior research domain

Marco Diana, Politecnico di Torino (I)

University of Maryland, College Park, 29th November 2014

2

Structure of the seminar1. Structural equation models are grounded on two

multivariate analysis statistical techniques : Multiple regression Principal component and factor analysis

2. Basic notions on structural equation models (SEM)

3. Use of SEM: needed input, range of output, most commonplace issues in travel behavior research

4. Available software packages

5. Discussion on some applications in the study of mobility behaviours

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Metric (quantitative) variables: Ratio scales

(Es: body weight, road length)

Interval scales(Es: temperature)

Nonmetric (qualitative) variables: Ordinal scales

(Es: degree of satisfaction)

Categorical scales(Es: sex)

3

Measurement scales (Stevens, 1946)

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Univariate and bivariate analyses One random variable:

Univariate distributions and related moments (mean, variance…)

Two random variables: Bivariate, joint and conditional distributions

and related moments Interdependence analyses => correlations

(Pearson, Spearman…), contingency tables Dependence analyses => Linear regression,

ANOVA

4Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

5

Multivariate statistical analysis tech.

From:Hair et al. (1998)

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

6

Multiple linear regression (1/2)

Operating instructions:

1.Dependence technique => need to identify x e y2.A unique linear relationship3.Only one metric dependent variable (y)4.Two or more linear independent variables (x1, x2, …), either metric or binary

Objective:

Find the value of parameters a0, a1, a2, … in

y = a0 + a1x1 + a2x2 + … + … such that the sum of squared errors (differences between the two terms) is mimimised (OLS).

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

x1

x2y

a1

a2

7

Multiple linear regression (2/2)

Assumptions:

1.Linear relationship

2.Errors independence

3.Normal distribution of errors

4.Constant variance of error (homoskedasticity)

NB1: multicollinearity of x variables «slightly less problematic» than in some discrete choice models

NB2: measurement errors are not distinguishable

SEM can be helpful in both cases!

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Factor & Principal Components Anal.Operating instructions:

1.Interdependence analysis => «We only have x»

2.Metric variables (possible extensions)

Objective:

Analize the correlation matrix of variables, looking for clusters of variables that are more correlated among them and less correlated with the others

Find latent variables (factors, constructs, components, dimensions) from such groups that can therefore «synthetise» o «represent» the observed x variables

8Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Common, specific and total variance Both methods are based on the study of the

variance in the data The common variance is the variance that is

shared among all x variables The specific variance is associated only to a

specific variable xi (including the one due to meas. errors)

The total variance is the sum of the two PCA: The input is the correlation matrix => this

method considers the total variance FA: The main diagonal of the correlation matric

contains an estimation of the common variance => the method considers only the common variance

9Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Principal component an. (Pearson, 1901)

Transformation of p observed variables x into p latent variables t, linear combinations of x

i.e., find the value of coefficients a11, a21, … in

t1 = a11x1 + a12x2 + … + a1pxp

t2 = a21x1 + a22x2 + … + a2pxp

tp = ap1x1 + ap2x2 + … + appxp

… such that:

The components t1 … tp are sorted by decreasing variance

The components ti are independent

10Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Factor analysis (Spearman, 1904)

Regression of p observed variables x on k<p latent variables

i.e., find the value of loadings 11, 21, … in

x1 = 111 + 122 + … + 1pk + 1

x2 = 211 + 222 + … + 2pk + 2

xp = p11 + p22 + … + ppk + p

… such that the factors can explain the common variance among the x variables

Unlike PCA, here we assume that factors actually exist (more formally, the covariance matrix of x variables must have some properties)

11Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Common requirements and results Both PCA and FA give meaningful results iff x

variables are at least partly correlated => multicollinearity is desirable!

Sample size: at least 5 observations per observed variable x, in any case at least 100

We consider the first k<p components of a PCA or we look for k<p factors through a FA => methods to choose k are needed

If the common variance is a consistent part of the total variance, the two methods give similar results

12Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

PCA ambits of use Aim: to represent data variability with the

minimum number of latent variables

Theoretical assumptions: none, we simply want to summarise the variables while trying to preserve the patterns within the dataset

Data characteristics: the specific variance and the one due to measurement errors are a negligible proportion of the total variance

13Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

x1

Component t2Component t1

x2 x3 x4 x5 x6 x7

a11a12 a13

a23

a13 a24 a25 a26a27

Factor Analysis ambits of use

Aim: identifying the dimensions, or latent factors, implied by the set of x variables being considered

Theoretical assumptions: latent factors do exists, on the basis of a theory that allows the interpretation of the observed correlations

Data characteristics: specific and measurement error variances are not negligible, therefore I consider only the common variance

14Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

x1

Factor 2Factor 1

x2 x3 x4 x5 x6 x7

11 12 1323

13 24 25 2627

The factor analysis we introduced is exploratory (EFA): the number of latent factors and their relationships with the observed variables are found a posteriori, through the analysis itself.

If we have a well founded theory and empirically supported by previous EFAs, it is better to define a priori factors and their relations with observed variables, computing loadings ij and checking the model «goodness of fit» => confirmatory technique (CFA)

15Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Exploratory vs confirmatory analysis

SEM can be used to implement a CFA!

Examples of combinations of the two methods:

16Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Combining regression and factor an.

Education

Age

Children <14Trip rates

Income

Higher-order factor analyses

Regression where some variables are latent

Safety

ReliabilityCognitive

Car attitudes

Mobility

Systematic trips

Holidays, VFR

Transfers

Income

Rootedness

Education

Nationality

Chained regressions: path analysis (Wright, 1934)

Freedom

Well-beingAffective

It would be possible to estimate the previous models by decomposing them and implementing n distinct regressions and/or factor analyses

However, this would be an inefficient use of data

17Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

SEM – Structural equation models

Structural equation models (Jöreskog et al., 1973)

Regression and FA are generalised and combined, through simultanous estimation of all parameters: Further results and «diagnostic tools» Further applications compared to the previous examples

Measurement model:x = x + y = y +

where x and y are esogenous and endogenous variables, and the latent ones, x and y are loadings matrices, and error terms

Structural model: = + +

where and are the structural coefficients matrices and error terms

The two models are jointly estimated.18Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

LISREL notation of a SEM model

Example (Hair, 1998)

19Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Model path diagram

Example, cont. (Hair, 1998)

20Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Complete model

Structural coefficients (regression coefficients) Factor loadings, both of exogenous and

endogenous variables Correlations between endogenous constructs (to

avoid!) or exogenous constructs (obviously not between endogenous and exogenous)

Variance of the measurement error of the observed variables (endogenous and exogenous)

Covariance of the measurement error of the observed variables (endogenous and exogenous)

Parameters that can be estimated

21Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Confirmatory technique => the analyst chooses which parameters should be estimated

Input: covariance or correlation matrix of the observed variables, as in factor analysis: Covariances: total effects are found, comparison

between different models/populations/samples (transferability)

Correlations: understanding patterns among variables and their relative importance

Assumptions: From regression: linear relationship, multivariate

normal distributions From sampling theory: random sample,

independent observations

Input and assumptions

22Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Dimensions of the sample: At least 100-150 observations 10 observations per parameter, 15 when non-

normality is detected Overfitting when we use more than 400

observations (too sensitive model) Estimation methods:

Parametric: maximum likelihood (ML) Non parametric: ADS-WLS => 1000

observations are needed Resampling: bootstrap, jackknife

Data requirement and estimation

23Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

A unique symptom could be due to different problems: estimation process not converging, variances<0, loadings>1, «mysterious» error messages… Unsound theoretical basis, specification errors Model identification: degrees of freedom,

scales and # of indicators per construct, rank and order conditions…

Non-normality when using a parametric estimation method

Algebraic properties of the input matrix (positive definite…)

Common problems in SEM

24Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Problems and symptoms are not univocally linked, the same goes for fit measures: Absolute fit Parsimonius fit Incremental fit Structural model fit (sign and significance of

coefficients, rho-squared) Measurement model fit (unidimensionality of

costructs, Cronbach’s alpha)

Goodness of fit measures in SEM

25Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Path analysis: Reciprocal implications (Non-recursive models) Direct, indirect and total effects Mean structures (different means of latent vars)

Regression with an estimation of correlations among variables (endogenous or exogenous, observed or latent) Models with repeated observations Models with longitudinal data (latent growth)

Including categorical variables Multiple sample models, mixture models

26Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Advanced SEM applications

You simply can’t do all this by combining R and FA!

LISREL 9.1 (Jöreskog et al.) EQS 6.1 (Bentler et al.) Mplus 7 (Muthén et al.) SAS => PROC CALIS (SAS Institute) Statistica => SEPATH (StatSoft) SPSS => Amos (IBM) R => sem, lavaan, …

(Packages that I used to be familiar with are in bold, they are not necessarily the best ones…)

Software for SEM estimation

27Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Golob (2003) reviewed more than 50 papers on a wealth of topics: Mode choice behaviors Determinants of car ownership and use Longitudinal and panel data analyses Activity-based models Travel attitudes-behaviors relationships Driving behaviors and safety issues

Obviously many more SEM papers have appeared since then, although I would have expected an ever sharper increase

SEM applications in travel research

28Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Travel demand derived only by the need of performing activities in different places… Activity-based models Utility-maximising models by minimising travel

times …but is it always true?

«Teleportation test»: 3% of the sample indicates an ideal commute time <2 min, 50% >20 min (Mokhtarian, 2001)

Random utility models where travel-time coefficients >= 0: always garbage or…

Example: primary utility (Diana, 2008)

29Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Goal: capturing and measuring the «primary utility» latent construct

Theoretical model => EFA => primary utility is due to different factors: Importance of on-trip activities Importance of activities at different locations Ideal trip length Travel-related cognitive and affective attitudes Performances and use of the travel means

Item analysis => 6 constructs are related to primary utility => Second order CFA

Example: primary utility (Diana, 2008)

30Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Model specification (Diana, 2008)

31Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Primary utility measurement scale

32Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Drivers versus

transit riders

Commuting versus other

trips

Modal diversion versus mode choice Demand for unknown services:

«cognitive asymmetry» <=> SP surveys Attitudes and rational evaluations have a

different relative importance according to the alternative

Behavioral modal diversion model: the endogenous variable measures the propension to change on a Likert scale

Data limitations => submodel implement. and considering standard estimations

Modal diversion (Diana, 2010)

33Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Standardized estimation=> comparing different structural coefficients

Modal diversion (Diana, 2010)

34Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Is there a difference in the diversion to buses and to shared taxis? => Comparing unstandardized estimations of the single structural equations in the two subsamples

SEM with subsamples

35Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Model with MULTIM All Buses DRT REL_COST -0.20 -0.11 * -0.07 * REL_TIME -0.25 -0.39 -0.21 REL_WAIT -0.15 -0.29 -0.14 REL_WALK -0.14 -0.05 ** -0.15 MULTIM 0.17 0.29 * 0.15 *

Model with COGNIT All Buses DRT REL_COST -0.19 -0.08 * -0.07 * REL_TIME -0.26 -0.38 -0.21 REL_WAIT -0.13 -0.27 -0.11 * REL_WALK -0.09 0.01 -0.10 * COGNIT -0.20 -0.08 ** -0.29

* = not signif. at the 5% level ** = not signif. at the 20% level

36

Thank you for your attention!

Structural equation models : opportunities, risks and discussion of some applications in the travel

behavior research domain

Question, remarks, …

Marco Diana

[email protected]

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

ADF-WLS = Asymptotically distribution-free weighted least squaresCFA = Confirmatory factor analysisEFA = Exploratory factor analysisFA = Factor analysis

List of acronyms

37Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014

Mentioned references

ML = Maximum likelyhoodOLS = Ordinary least squaresPCA = Principal components analysisSEM = Structural equations modelVFR = Visiting friends and relatives

• Diana, M. (2008) Making the “primary utility of travel” concept operational: a measurement model for the assessment of the intrinsic utility of reported trips, Transportation Research A, 42(3), 455-474.

• Diana, M. (2010) From mode choice to modal diversion: a new behavioural paradigm and an application to the study of the demand for innovative transport services, Technological Forecasting & Social Change, 77(3), 429-441.

• Golob, T.F. (2003) Structural equation modeling for travel behavior research, Transportation Research B, 37(1), 1-25.

• Hair, J.F., Anderson, R.E., Tatham, R.L., Black, W.C. (1998) Multivariate Data Analysis, 5 ed. Prentice Hall (but more recent editions are now available)

• Mokhtarian, P.L., Salomon, I. (2001) How derived is the demand for travel? Some conceptual and measurement considerations, Transportation Research A, 35(8), 695-719.