1 Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to...

Class 5

Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to Change

October 22, 2009

Anita L. Stewart Institute for Health & Aging

University of California, San Francisco

Overview

Types of validity in health assessment– Focus on construct validity

How bias affects validity– Socially desirable responding and culture as

sources of bias Sensitivity to change

Validity

Does a measure (or instrument) measure what it is supposed to measure?

And…Does a measure NOT measure what it is NOT supposed to measure?

Valid Scale? No!

There is no such thing as a “valid” scale We accumulate “evidence” of validity in a

variety of populations in which it has been tested

Similar to reliability

Validation of Measures is an Iterative, Lengthy Process

Accumulation of evidence– Different samples

– Longitudinal designs

Types of Measurement Validity

Content

Criterion Construct

– Convergent– Discriminant– Convergent/discriminant

All can be: Concurrent Predictive

Content Validity:

Relevant when writing items Extent to which a set of items represents

the defined concept

Relevance of Content Validity to Selecting Measures

“Conceptual adequacy” Does “candidate” measure represent

adequately the concept YOU are intending to measure

Content Validity Appropriate at Two Levels

Battery or Are all relevant domainsinstrument represented in an instrument?

Measure Are all aspects of a defined

concept represented in theitems of a scale?

Example of Content Validity of Instrument

You are studying health-related quality of life (HRQL) in clinical depression– Your HRQL concept includes sleep problems,

ability to work, and social functioning SF-36 - a candidate

– Missing sleep problems

Content

Criterion Construct

Criterion Validity

How well a measure correlates with another measure considered to be an accepted standard (criterion)

Can be– Concurrent– Predictive

Criterion Validity of Self-reported Health Care Utilization

Compare self-report with “objective” data (computer records of utilization)– # MD visits past 6 months (self-report)

correlated .64 with computer records

– # hospitalizations past 6 months (self-report) correlated .74 with computer records

Ritter PL et al, J Clin Epid, 2001;54:136-141

Criterion Validity of Screening Measure

Develop depression screening tool to identify persons likely to have disorder– Do clinical assessment only on those who

screen “likely” Criterion validity

– Extent to which the screening tool detects (predicts) those with disorder» sensitivity and specificity, ROC curves

Criterion Validity of Measure to Predict Outcome

If goal is to predict health or other outcome– Extent to which the measure predicts the

outcome Example: Develop self-reported war-

related stress measure to identify vets at risk of PTSD– How well does it predict subsequent PTSD

(Vogt et al., 2004, readings)

Content

Criterion Construct

Construct Validity Basics

Does measure relate to other measures in hypothesized ways?– Do measures “behave as expected”?

3-step process– State hypothesis: direction and magnitude

– Calculate correlations

– Do results confirm hypothesis?

Source of Hypotheses in Construct Validity Prior literature in which associations

between constructs have been observed– e.g., other samples, with other measures

of constructs you are testing Theory, that specifies how constructs should

be related Clinical experience

Who Tests for Validity?

When measure is being developed, investigators should test construct validity

As measure is applied, results of other studies provide information that can be used as evidence of construct validity

Content

Criterion Construct

Convergent Validity

Hypotheses stated as expected direction and magnitude of correlations

“We expect X measure of depression to be positively and moderately correlated with two measures of psychosocial problems”– The higher the depression, the higher the level

of problems on both measures

Testing Validity of Expectations Regarding Aging Measure

Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age

Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity

Sarkisian CA et al. Gerontologist. 2002;42:534

Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age (convergent)

Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity

ERA-38 Convergent Validity Results: Hypothesis 1

ERA-38ERA FunctionalIndependence

ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**

Comorbidity - .09* ns

Depressive symptoms - .33** - .28**

Age - .24** - .14**

ERA-38: Non-Supporting Convergent Validity Results

ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**

Age - .24** - .14**

Content

Criterion Construct

Discriminant Validity: Known Groups

Does the measure distinguish between groups known to differ in concept being measured?

Tests for mean differences between groups

Example of a Known Groups Validity Hypothesis Among three groups:

– General population– Patients visiting providers– Patients in a public health clinic

Hypothesis: scores on functioning and well-being measures will be the best in a general population and the worst in patients in a public health clinic

Mean Scores on MOS 20-item Short Form in Three Groups

PublicGeneral MOS health

population patients patientsPhysical function 91 78 50Role function 88 78 39Mental health 78 73 59Health perceptions 74 63 41

Bindman AB et al., Med Care 1990;28:1142

PedsQL Known Groups Validity

Hypothesis: PedsQL scores would be lower in children with a chronic health condition than without

Child report: Total score

Emotionalfunctioning

Chron ill* 77 (16) 76 (22)

Acutely ill* 79 (14) 77 (20) ANOVA, p = .001

Healthy 83 (15) 81 (20)

* Different from healthy children, p < .05

JW Varni et al. PedsQL™ 4.0: Reliability and Validity of the Pediatric Quality ofLife Inventory™ …, Med Care, 2001;39:800-812.

Content

Criterion Construct

Convergent/Discriminant Validity

Does measure correlate lower with measures it is not expected to be related to … than to measures it is expected to be related to?

The extent to which the pattern of correlations conforms to hypothesis is confirmation of construct validity

Basis for Convergent/Discriminant Hypotheses

All measures of health will correlate to some extent

Hypothesis is of relative magnitude

Example of Convergent/Discriminant Validity Hypothesis Expected pattern of relationships:

– A measure of physical functioning is “hypothesized” to be more highly related to a measure of mobility than to a measure of depression

Example of Convergent/Discriminant Validity Evidence

Pearson correlation:

Mobility Depression

Physical functioning .57 .25

Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age (convergent)

Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity (convergent/discriminant)

ERA-38 Convergent/Discriminant Validity Results: Hypothesis 2

ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**

Age - .24** - .14**

ERA-38: Non-Supporting Validity Results

ADL .19** .20***

PCS-12 .27** .32***

MCS-12 .35** .30**

Age - .24** - .14**

Construct Validity Thoughts: Lee Sechrest

There is no point at which construct validity is established

It can only be established incrementally– Our attempts to measure constructs help us

better understand and revise these constructs

Sechrest L, Health Serv Res, 2005;40(5 part II), 1596

Construct Validity Thoughts: Lee Sechrest (cont)

“An impression of construct validity emerges from examining a variety of empirical results that together make a compelling case for the assertion of construct validity”

Construct Validity Thoughts: Lee Sechrest (cont)

Because of the wide range of constructs in the social sciences, many of which cannot be exactly defined..– …once measures are developed and in use,

we must continue efforts to understand them and their relationships to other measured variables.

Interpreting Validity Coefficients

Magnitude and conformity to hypothesis are important, not statistical significance – Nunnally: rarely exceed .30 to .40 which may be adequate

(1994, p. 99) – McDowell and Newell: typically between 0.40 and 0.60

(1996, p. 36) Max correlation between 2 measures = square root

of product of reliabilities– 2 scales with .70 reliabilities, max correlation .70– Correlation of .60 would be “high”

Overview

Components of an Individual’s Observed Item Score (from Class 3)

Observed true item score score

= + error random

systematic

Random versus Systematic Error

Observed true item score score

= + error random

systematic

Relevant to reliability

Relevant to validity

Bias is Systematic Error

Affects validity of scores– If scores contain systematic error, cannot

know the “true” mean score

– Will obtain an observed score that is either systematically higher or lower than the “true” score

“Bias” or “Systematic Error”?

Bias implies that the direction of error known

Systematic error – direction neutral– Same error applies to entire sample

Sources of “Systematic Error” in Observed Scores of Individuals

Respondent– Socially desirable responding– Acquiescent response bias– Cultural beliefs (e.g., not reporting distress)– Halo affects

Observer– Belief that respondent is ill

Instrument

Socially Desirable Responding

Tendency to respond in socially desirable ways to present oneself favorably

Observed score is consistently lower or higher than true score in the direction of a more socially acceptable score

Socially Desirable Response Set – Looking “good”

After coming up with an answer to a question, respondent “screens” the answer– “Will this answer make the person like me less?”

May “edit” their answer

Systematic underreporting of “risk” behavior example– A woman has 2 drinks of alcohol a day, but

responds that she drinks a few times a week

Ways to Minimize Socially Desirable Responding

Write items and instructions to increase “acceptability” of an “undesirable” response

Instead of:– “Have you followed your doctor’s

recommendations?” Use:

– “Have you had any of the following problems following your doctor’s recommendations?”

Acquiescent Response Set Tendency to

– agree with statements regardless of content– give “positive” response such as yes, true, satisfied

Extent and nature of bias depends on direction of wording of the questions

Minimizing acquiescence:– Include positively- and negatively-worded items in

the same scale

Example of Systematic Error Due to Cultural Norms or Beliefs

A person feels sad “most of the time” Unwilling to admit this to the interviewer so

answers “a little of the time”– Not culturally appropriate to admit to negative

feelings– Always present a positive personality

Observed response reflects less sadness than “true” sadness of respondent

Discrepancies in Information Sources: Systematic Error or Different Perspectives?

In reporting on a patient’s well-being– Patients report highest levels

– Clinicians report levels in the middle

– Family members report the lowest levels No way to know which is the “true” score

– to say one score is “biased” implies another one is the “true score”

Overview

Sensitivity to Change: Two Issues

Measure able to detect true changes One knows how much change is meaningful

on the measure

Measure Able to Detect True Change

Sensitive to true differences or changes in the attribute being measured

Sensitive enough to measure differences in outcomes that might be expected given the relative effectiveness of treatments– Ability of a measure to detect change

statistically

Importance of Sensitivity

Need to know measure can detect true change if planning to use it as outcome of intervention

Approaches for testing sensitivity are often simultaneous tests of – effectiveness of an intervention

– sensitivity of measures

Measuring Sensitivity

Score is stable in those who are not changing Score changes in those who are actually

changing (true change)

One method– Identify groups “known” to change– Compare changes in measure across these groups

Sensitivity to Change Evidence for PHQ-9 (Short Screener for Depression)

Classified patients with major depression (DSM-IV criteria) over time as:– Persistent depression– Partial remission– Full remission

Examined PHQ-9 change scores in these “known groups”

Löwe B et al. Med Care, 2004;42:1194-1201

Changes in PHQ-9 Scores by Change in Depression at 6 Months

Mean change Effect size

Persistent depression -4.4 -0.9

Partial remission -8.8 -1.8

Full remission -13.0 -2.6

Löwe et al, 2004, p. 1200

Considerations in Developing CHAMPS Physical Activity (PA) Questionnaire

Needed outcome measure to detect PA changes due to CHAMPS lifestyle intervention– increase PA levels in everyday life (e.g.,

walking, stretching) in activities of their choice

Existing measures designed to capture younger persons’ PA

Stewart AL et al. Med Sci Sports Exerc, 2001;33:1126-1141.

Changes in Measure Resulting from Intervention: Validity Evidence for Others

After CHAMPS intervention detected PA change, others used our results as evidence of “sensitivity to change” – Used in Project ACTIVE because of it’s

sensitivity to change in CHAMPS (S Wilcox et al, Am J Pub Health, 2006;96:1201-1209)

Sensitivity to Change: Two Issues

Measure able to detect true changes One knows how much change is meaningful

on the measure

Relevant or Meaningful Change

Is the observed change important? To clinician:

– change might influence patient management To patient:

– patient notices change

– amount of change matters

“Minimal Important Difference” (MID)

The minimal difference that would result in a change in treatment

The smallest change perceived by patients as beneficial

Two Basic Approaches to Estimate MID

Anchor-based methods– Require external criterion of change

Distribution based methods– Statistical indicators of change

Anchor-Based Approaches to Estimating MID

Requires longitudinal studies Criteria:

– Clinical endpoints

– Patient-rated global improvement

– Some combination

Example of Anchor-Based Approach

Identify a subgroup in a study that has changed by a “minimal” amount– Clinical change

– Patient reported change Change score in a relevant health

measure for this subgroup = MID

Locating Groups that Have Changed “Minimally”

Administer a global rating of change (perceived change) by patients – the anchor

Select subset that reported “somewhat better” or “somewhat worse” – change in a relevant health measure for this

subset = MID

Two Categories Can Define “Minimal Change” Groups

Since your surgery, how would you rate the amount of change in your physical functioning? – Much worse– Somewhat worse– About the same– Somewhat better– Much better

Meaning of Change Depends on Direction of Change

A change for the better may result in a different MID than a change for the worse

May need to evaluate these as separate estimates

Example: Mean 2-week Change Score in Symptom Measure by Perceived Change

Mean change

Much better 2.25

A little better 1.41 minimal positive change?

About the same 0.42

A little worse -0.29 minimal negative change?

Much worse -0.10

C Paterson. BMJ, 1996;312:1016-20.

Distribution-Based Methods

Ways of expressing the observed change in a standardized metric

Three commonly used:– Effect size (ES)

» Mean change divided by SD at baseline– Standardized response mean (SRM)

» Mean change divided by SD of changes– Responsiveness statistic (RS)

» Mean change divided by SD of change for people who have not changed

Mean 4-week Change Score in Four Measures and Responsiveness Statistic

Measure Patients who are “about the same”

Patients who are “a little better”

Responsiveness

Symptom 1 0.58 1.64 1.14

Activity 0.46 1.64 1.33

Well-being 0.39 0.68 0.39

C Paterson. BMJ, 1996;312:1016-20.

Note: scores range from 1-7; Higher change scores indicate improvement

Multi-item Measures: More Likely to Detect Change

Instrument needs to have sufficient variability to detect change– Multi-item scales: many scale levels

Look for evidence of good variability in sample like yours (at baseline)– Room to improve

Effect Size of Changes in Health Due to Treatment for Menstrual Bleeding

Drugs Surgery

Self-rated health item -.18 -.10

Health perceptions scale (5 items)

-.03 -.64

Energy/vitality -.23 -.89

Mental health -.14 -.65

Pain -.12 -.73

C Jenkinson et al. Qual Life Res, 1994;3:317-321.

Summary: MID of Measures

MID is based on evidence from multiple studies– Over time, learn whether evidence is strong for a

particular MID MID of a measure in one context may not

generalize to another one– e.g. MID for treatment of pain in cancer may differ

from MID for treatment of back pain

Readings as a Resource

Farivar et al.– Issues in measuring MID

Stewart et al – Methods for assessing validity (as developed

for the Medical Outcomes Study) Sechrest

– Classic commentary on validation issues

Next Class (Class 5)

Factor analysis with Steve Gregorich

Homework

Complete rows 20-26 in matrix for your two measures– Validity, responsiveness and sensitivity to

change, scoring, and costs

1 Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to...

Documents

Transcript of 1 Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to...

Fitting Psychometric Functions

Introducing Psychometric AI

1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging.

Psychometric Tool

1 Class 6 Additional Psychometric Characteristics: Validity and Bias, Responsiveness and Sensitivity to Change October 25, 2007 Anita L. Stewart Institute.

Fluid Responsiveness

COOP/WONCA Charts · 2011. 11. 18. · charts: instructions how to use and analyse them, a brief overview of the psychometric properties, reliability, validity and responsiveness

Psychometric process

Personality Psychometric Testing

PSYCHOMETRIC ASSESSMENT SERVICES...PSYCHOMETRIC ASSESSMENT SERVICES The Emergency Support Network (ESN) strongly recommends our clients utilise our premium psychometric assessment

Psychometric assessment

A psychometric evaluation of the Multidimensional Social ...summit.sfu.ca/system/files/iritems1/18403/journal.pone...tionnaire” (BAPQ), and the “Social Responsiveness Scale”

Māori Responsiveness

Psychometric Testing

1 Class 4 Psychometric Characteristics Part I: Sources of Error, Variability, Reliability, Interpretability October 12, 2006 Anita L. Stewart Institute.

Psychometric Guide2

Psychometric instrument development

Psychometric testingpvl

Development of the Pain Appraisal Inventory: Psychometric ...downloads.hindawi.com/journals/prm/1998/709372.pdfPain Appraisal Inventory: Psychometric properties Anita M Unruh PhD MSW

Psychometric and geometri