Statistical and Qualitative Approaches for Facilitating ...

70
Statistical and Qualitative Approaches for Facilitating Comparability in Cross- Lingual Assessments Stephen G. Sireci Center for Educational Assessment University of Massachusetts, USA Presentation delivered at the OECD Seminar on Translating and Adapting Instruments in Large- Scale Assessments June 8, 2018, Paris © Copyright, Stephen G. Sireci, 2018. All rights reserved

Transcript of Statistical and Qualitative Approaches for Facilitating ...

Page 1: Statistical and Qualitative Approaches for Facilitating ...

Statistical and Qualitative Approaches for Facilitating

Comparability in Cross-Lingual AssessmentsStephen G. Sireci

Center for Educational AssessmentUniversity of Massachusetts, USAPresentation delivered at the OECD Seminar on Translating and Adapting Instruments in Large-

Scale AssessmentsJune 8, 2018, Paris

© Copyright, Stephen G. Sireci, 2018. All rights reserved

Page 2: Statistical and Qualitative Approaches for Facilitating ...

What is cross-lingual assessment?

lAssessing people who communicate using different languages, by using multiple language versions of an assessment.

Page 3: Statistical and Qualitative Approaches for Facilitating ...

Why have multiple language versions of a test?

lNeed to compare individuals who function in different languages– International comparisons– Cross-cultural research– Certification Testing– Employee surveys– Employee selection– Admissions testing– District testing

Page 4: Statistical and Qualitative Approaches for Facilitating ...

Why have multiple language versions of a test (cont.)?

lEnhance fairness:–Remove the language barrier.–Adjust tests for culture- or

language-related differences that are not relevant to “construct” measured.

–Provide options for people to select best language.

Page 5: Statistical and Qualitative Approaches for Facilitating ...

Multiple-language versions of a test

lSeen as one way to promote FAIRNESS by allowing examinees to access and interact with the test in their native language.

Page 6: Statistical and Qualitative Approaches for Facilitating ...

The Standards for Educational & Psychological Testing

A test that is fair within the meaning of the Standardsa) reflects the same construct(s) for all

test takersb) scores from it have the same meaning

for all individuals in the intended population

c) a fair test does not advantage or disadvantage some individuals because of characteristics irrelevant to the intended construct.

Page 7: Statistical and Qualitative Approaches for Facilitating ...

The (1999) Standards pointed out…any test that employs language is,

in part, a measure of [students’] language skills… This is of particular concern for test takers whose first language is not the language of the test… In such instances, test results may not reflect accurately the qualities and competencies intended to be measured. (AERA, et al., 1999, p. 91)

Page 8: Statistical and Qualitative Approaches for Facilitating ...

Federal law in the United States

(NCLB) required states to assess English learners using “assessments in the language and form most likely to yield accurate data on what such students know and can do in academic content areas, until such students have achieved English language proficiency” (NCLB, 2002, cited in Stansfield, 2003, p. 203).

Page 9: Statistical and Qualitative Approaches for Facilitating ...

Interest in “multilingualism” in the USA is very different from 100 years ago:“If English was good enough for

Jesus, it’s good enough for the school children of Texas.”

Texas Governor James “Pa” Ferguson (1917) after vetoing a bill to finance the teaching of foreign languages in classrooms.

Page 10: Statistical and Qualitative Approaches for Facilitating ...

How do we assess people who communicate using different languages?

lThe most common procedure is to translate (adapt) an existing test into other languages.

Page 11: Statistical and Qualitative Approaches for Facilitating ...

In this talk, I will describe

1. validity issues2. “quality control” procedures3. research designs4. statistical methodsfor developing and evaluating tests designed for cross-lingual assessment

Page 12: Statistical and Qualitative Approaches for Facilitating ...

I will also discuss

5. Standards and Guidelines relevant to cross-lingual assessment– AERA et al. (2014) Standards– ITC Guidelines on Translating and

Adapting Tests (2017)– Briefly (Hambleton, 2018—today!)

Page 13: Statistical and Qualitative Approaches for Facilitating ...

ITC Guidelines on Translating and Adapting TestslMost famous of all the ITC guidelineslFirst published by Hambleton (1994)lMany versions & revisions

– 2005, 2010, 2005, 2017

Page 14: Statistical and Qualitative Approaches for Facilitating ...

Standards for test score interpretation (1)lAERA et al. Standards (2014):Simply translating a test from one

language to another does not ensure that the translation produces a version of the test that is comparable in content and difficulty level to the original version of the test, or that the translated test produces scores that are equally reliable/precise and valid... (p. 60)

Page 15: Statistical and Qualitative Approaches for Facilitating ...

ITC “Test Adaptation” Guidelines(2017)

l Instrument developers/publishers should implement systematic judgmental evidence, both linguistic and psychological, to improve the accuracy of the adaptation process and compile evidence on the equivalence of all language versions.

Page 16: Statistical and Qualitative Approaches for Facilitating ...

ITC “Test Adaptation” Guidelines(2017)

lTest developers/publishers should apply appropriate statistical techniques to (1) establish the equivalence of the language versions of the test, and (2) identify problematic components or aspects of the test that may be inadequate in one or more of the tested populations.

Page 17: Statistical and Qualitative Approaches for Facilitating ...

SUMMARY: What do the professional standards say?

lAdapted tests cannot be assumed to be equivalent.

lResearch is necessary to demonstrate test and score comparability.

lStatistical methods can help evaluate item/test comparability.

Page 18: Statistical and Qualitative Approaches for Facilitating ...

Test Adaptation and Research Methodology

Good research designs are needed for – The translation/adaptation process

(qualitative)– Evaluating the similarity

(comparability) of the scores from different language tests (quantitative)

– Establishing formal relationships among the different tests (quantitative)

Page 19: Statistical and Qualitative Approaches for Facilitating ...

Adaptation and Validity

lAll research designs are designed to provide evidence of validity– of score interpretations– for the use of the test scores

But what is validity?

Page 20: Statistical and Qualitative Approaches for Facilitating ...

Standards’ Definition

l“Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests”

AERA, APA, & NCME (2014, p. 11)

Page 21: Statistical and Qualitative Approaches for Facilitating ...

Therefore, validity issues in cross-lingual assessment must consider the purpose of the assessment

lValidation involves gathering data to support (defend) the use of a test for a particular purpose.

lDifferent research designs are needed to support different uses of test scores (i.e., different interpretations of scores)

Page 22: Statistical and Qualitative Approaches for Facilitating ...

Adapting a test to use in another language/culturelIf test score interpretations are to

be made WITHIN a language group, no evidence of comparability across languages is needed.– Evidence that scores are appropriate

for whatever they are used for is (obviously) still needed.

– E.g., Scores on the Prueba de Aptitud Academica used only for selection into graduate school in Puerto Rico

Page 23: Statistical and Qualitative Approaches for Facilitating ...

However, if scores are to be compared ACROSS language groups:lNeed to validate comparative

inferences– E.g, TIMSS, PISA

lHow?– Demonstrate construct

equivalence– Rule out method bias– Rule out item bias(van der Vijver)

Page 24: Statistical and Qualitative Approaches for Facilitating ...

Are scores from different language versions of an exam supposed to be “comparable?”lIf comparable means “equated,”

or on the same scale, this is probably impossible.– In its strictest sense, equating

implies examinees would get the same score no matter which “form” of the exam they took.

However, comparability does not need to mean “equivalent” or equated.

Page 25: Statistical and Qualitative Approaches for Facilitating ...

Can test scores from different language versions of an exam be considered “comparable?”lThrough (a) careful adaptation

procedures, and (b) statistical evaluation of score comparability, an argument can be made that cross-linguistic inferences are appropriate.

Page 26: Statistical and Qualitative Approaches for Facilitating ...

Adapting “comparable” tests across languages

Involves1. Quality adaptation involving multiple

steps, multiple translators, and multiple quality control checks

2. Statistical analysis of structural equivalence and differential item functioning

3. Qualitative analysis of item bias and method bias

4. Developing a sound validity argument for comparative inferences

Page 27: Statistical and Qualitative Approaches for Facilitating ...

What can we do to promote fairness in cross-lingual assessment?lQualitative: Develop and translate

tests carefullylQuantitative: Evaluate the

statistical characteristics of test and item scores– factor structure– differential item functioning– consistency of criterion-related validity

evidence

Page 28: Statistical and Qualitative Approaches for Facilitating ...

Questions that statistical techniques can help us answerlHow “similar” are tests?lHow “similar” are items?lHow “comparable” are test scores?

Page 29: Statistical and Qualitative Approaches for Facilitating ...

Methods for evaluating construct equivalence:lDifferential predictive validity

studieslExploratory factor analysislConfirmatory factor analysislMultidimensional scalingl IRT residual analysislDifferential item functioning

Page 30: Statistical and Qualitative Approaches for Facilitating ...

Problem

lA valid external criterion for differential prediction is hard to find.

Page 31: Statistical and Qualitative Approaches for Facilitating ...

Exploratory Factor Analysis

lPrincipal components analysislCommon factor analysis

Common practice is to conduct separate analysis in each language group and compare solutions.– Same number of factors?– Items load (cluster) in the same way?– Subjective comparison of solutions.

Page 32: Statistical and Qualitative Approaches for Facilitating ...

Multigroup (simultaneous) AnalyseslConfirmatory factor analysis (CFA)

(or Structural Equation Modeling—SEM)

lMultidimensional scaling– Weighted MDS

lUnlike exploratory factor analysis, both SEM and MDS allow for simultaneous analysis across multiple groups

Page 33: Statistical and Qualitative Approaches for Facilitating ...

CFA Model

y = Λη + εwhere y is a (p x 1) column vector of scores for person i on the p items,

Λ is a (p x q) matrix of loadings of the p items on the q latent variables, η is a q x 1 vector of latent variables, and ε is (p x p) column vector of measurement residuals.

Page 34: Statistical and Qualitative Approaches for Facilitating ...

CFA Model (continued)

y = Λη + ε

Sx=observed variance/covariance matrix,F=variance/covariance matrix of latent variables, andQd=diagonal matrix of error variances.

dQ+ÙFÙ=S 'xxx

Multiple-group CFA involves evaluating the invariance of Sx, F, and/or Qd across groups

Page 35: Statistical and Qualitative Approaches for Facilitating ...

å=

-=r

ajaiakaijk xxwd

1

2)(

Weighted MDS model

i,j=items, k=group, a=dimension, x=coordinate, w=group weight on dimension

Page 36: Statistical and Qualitative Approaches for Facilitating ...

Example of using CFA & MDS to evaluate different language versions of a testlSireci, Bastari, & Allalouf (1998)

Psychometrics Entrance Test (PET)– Used in Israel for postsecondary

admissions decisions– We looked at Hebrew and Russian

versions of items from the Verbal Reasoning section of the exam.

Page 37: Statistical and Qualitative Approaches for Facilitating ...

PET: Analysis of Construct EquivalencelVerbal reasoning test:

• item types: (a) analogies, (b) logic, (c) reading comprehension, (d) sentence completion

l2 test forms, 2 language versions• Hebrew• Russian

lMethods• PCA• Confirmatory Factor Analysis (CFA)• Weighted MDS

Page 38: Statistical and Qualitative Approaches for Facilitating ...

PET: CFA Results(4-Factor Model)

Model GFI RMR

Common Factor Structure .97 .057

= Factor loadings .96 .060

= Uniquenesses .96 .066

= Correlations among factors

.96 .076

Page 39: Statistical and Qualitative Approaches for Facilitating ...

Next, lets look at the MDS resultslFirst, we will look at the “item

space,”lthen, we will look at the “weight

space,” which contains the information regarding structural equivalence

Page 40: Statistical and Qualitative Approaches for Facilitating ...

Figure 1

MDS Configuration of PET Verbal Items

AL=Analogy, LO=Logic,RC=Reading Compr., SC=Sentence Compl.

Dimension 1 (Reading Comprehension)

3.02.01.00.0-1.0-2.0-3.0

Dim

ensio

n 2

(Log

ical R

easo

ning

) 3

2

1

-1

-2

-3

RCRC

RC

RCRC

LO

LOLO

LOLO

LOSC

SCSC

SCSC

AL AL

AL

AL

RCRCRC

RCRCLO

LO

LOLOLO

LOSC

SCSC

SCSC

ALALALALAL

Page 41: Statistical and Qualitative Approaches for Facilitating ...

Figure 2

Group Weights for PET Data

Note: H=Hebrew, R=Russian

Dimension 1

.40.35.30.25.20.15.10.050.00

Dim

ensio

n 2

.40

.35

.30

.25

.20

.15

.10

.050.00

R R

HH

å=

-=r

ajaiakaijk xxwd

1

2)(

Page 42: Statistical and Qualitative Approaches for Facilitating ...

Another MDS Example: An International Credentialing Exam (Robin et al., 2000)lMore typical:

– Large English volume– Small international volume

l4 Languages– English– Romance language– Two different Altaic languages

l International sample sizes <200.

Page 43: Statistical and Qualitative Approaches for Facilitating ...

The next slide shows the MDS “weight space”lComparing 4 language versions of

the test:1. Altaic language 12. Altaic (different) language 23. Romance language4. English language (3 random samples,

varying sample sizes to match other groups)

Page 44: Statistical and Qualitative Approaches for Facilitating ...

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Dimension 1(English)

Dimension2

ES2ES1

E

AL1RLO

AL2

Conclusion: Different dimensions were needed to account for the structure of each language version.

Page 45: Statistical and Qualitative Approaches for Facilitating ...

Another example: Employee opinion survey (Sireci, Harter , Yang, & Bhola, 2003)

ØA very large international communications company

ØAvailable in 8 languagesØ47 different culturesØAvailable in P&P & web formatsØ50, 5-point Likert-type items

Page 46: Statistical and Qualitative Approaches for Facilitating ...

Figure 3

Language Group MDS Weights

CE=Can. English, CF=Can. French, FR=French, IR=Ireland (English)

UK=United Kingdom, US=United States, SP=Spanish

Dimension 5 (Interpersonal Relations)

.70.60.50.40.30.20.100.00Dim

ensi

on 3

(Glo

bal S

atis

fact

ion)

.7

.6

.5

.4

.3

.2

.1

0.0

IR

UK

UK

CF

CF

CE

CE

FR

SP

US

US

Page 47: Statistical and Qualitative Approaches for Facilitating ...

Figure 2

Web/Paper Group MDS Weights

P=Paper Suvey, W=Web Survey

Dimension 3 (Global Satisfaction)

.70.60.50.40.30.20.100.00

Dim

ensi

on 2

(Bus

. Obj

ectiv

es) .7

.6

.5

.4

.3

.2

.1

0.0

W

W

P

W

P

W

P

W

P

W

P

Page 48: Statistical and Qualitative Approaches for Facilitating ...

IRT Residual Analysis

lFit IRT model(s) to the data for each language group.– Is fit adequate in both groups?– Are residuals (errors) small in both

groups?(Reise, Widaman, & Pugh, 1993)

Page 49: Statistical and Qualitative Approaches for Facilitating ...

Evaluating DIF/item biaslCareful translation assesses

potential differences across language versions of an item.

lJust as item analysis catches problems item writers missed, cross-lingual DIF analyses catch translation problems.

lCross-lingual DIF analyses also catch differences in cultural familiarity.

Page 50: Statistical and Qualitative Approaches for Facilitating ...

Important NotelMethods for detecting DIF were not

designed for studying translated/adapted items.

lProblem:(a) Cannot assume translated/adapted

items are equivalent(b) Cannot assume different language

groups are equivalent

Page 51: Statistical and Qualitative Approaches for Facilitating ...

How is translation DIF different from “normal” DIF?l Items are NOT the “same” for

studied groups.lGroups cannot be considered

randomly equivalent.lDifferent groups, different items….

Page 52: Statistical and Qualitative Approaches for Facilitating ...

How can this problem be solved?l It cannot be solved.

lHowever, there are (at least) 4 things that can help:

1) Careful adaptation procedures2) Advanced research designs3) Aforementioned statistical analyses4) Making certain assumptionsSystematic bias must be ruled out to

justify conditioning variable in DIF analyses.

Page 53: Statistical and Qualitative Approaches for Facilitating ...

1) Careful adaptation proceduresExample: Angoff and Cook (1988)l Items in Spanish translated to

English.l Items in English translated to

Spanish.l Independent translators evaluated

translations.lAlso, iterative DIF screening

procedures.

Page 54: Statistical and Qualitative Approaches for Facilitating ...

2) Advanced research designsExample: Sireci & Berberoglu (2000)lBilingual group designlTwo (randomly equivalent) groups

of Turkish-English bilinguals took counterbalanced English/Turkish surveys.

lRandom equivalence was tested.lPolytomous (Likert) data:

– Samejima’s graded response model was used (IRT-LR)

Page 55: Statistical and Qualitative Approaches for Facilitating ...

2) Advanced research designs

Sireci & Berberoglu (2000)– Items identified for DIF were

removed from conditioning variable (theta) before making comparisons across groups• i.e., purified criterion

– Non-DIF items could be used to anchor scale across languages

Page 56: Statistical and Qualitative Approaches for Facilitating ...

Other advanced research design idea:lLink score scales through an

external criterion.– (Wainer, 1993)– Separate predictive validity

studies– Logical, but has it ever been

used?lProblem: How to validate

criterion.

Page 57: Statistical and Qualitative Approaches for Facilitating ...

Other approaches

Use DIF screening to identify items to form link.– E.g., PET exam Allalouf, A., Rapp, J., & Stoller, R. (2009).

What item types are best suited to the linking of verbal adapted tests? International Journal of Testing, 9, 92-107.

lAlso used by PIRLS, PISA, TIMSS

Page 58: Statistical and Qualitative Approaches for Facilitating ...

3) Statistical analyseslAnalysis of structural

equivalence– Evaluate factorial invariance– Justify matching variable for DIF

analysisl“Double Linking” equating

evaluation– Rapp & Allalouf (2003). Evaluating

cross-lingual equating. International Journal of Testing, 3, 101-117.

Page 59: Statistical and Qualitative Approaches for Facilitating ...

4) Making certain assumptionslGroups are randomly equivalent

– Canada

lAnchor items are equivalent across languages (Allalouf et al., 2009)– Screened items– Non-verbal items

lNo systematic biasThese assumptions must be defended!

Page 60: Statistical and Qualitative Approaches for Facilitating ...

SUMMARY: Evaluating Cross-lingual Assessments:

What have we learned?1. Judgmental methods for evaluating

adapted tests and items are not enough.

2. Statistical procedures are also necessary.

3. Relatively simple DIF detection procedures are effective for identifying “big” problem items, even when sample sizes are small.

Page 61: Statistical and Qualitative Approaches for Facilitating ...

Lessons Learned (continued)4. CFA and MDS are useful for

evaluating construct equivalence across different language versions of a test.

Need to look at structure simultaneously across multiple groups.CFA: confirmatoryMDS: exploratory

Page 62: Statistical and Qualitative Approaches for Facilitating ...

Lessons Learned (continued)

5. DIF is often explainable.ØDifferences in cultural relevanceØTranslation problemsØAdministration problems

Page 63: Statistical and Qualitative Approaches for Facilitating ...

Lessons Learned (continued)

6. Validate testing purpose(s) associated with cross-lingual assessment

Page 64: Statistical and Qualitative Approaches for Facilitating ...

If scores/people are comparedacross languageslSome type of linking is needed.lLinking based on subset of items

most likely to be psychologically and statistically equivalent may be defensible.– Confirmed by translation teams– Pass DIF screening procedures– Representative of total test– Common stratifying variable(s)

defended via structural analysis

Page 65: Statistical and Qualitative Approaches for Facilitating ...

Lessons Learned (continued)7. Given that linking is not equating,

cross-lingual inferences should be cautiously interpreted.– What cautions should international

assessments provide for cross-lingual comparisons?

– What cautions should they provide for same-language, cross-cultural comparisons?

– What cautions should be provided for same language/culture comparisons over time?

Page 66: Statistical and Qualitative Approaches for Facilitating ...

Next Steps/Future ResearchlWhat are the best ways to interpret

and communicate cross-lingual inferences?

lUsing the AERA et al Standards 5 sources of validity evidence to evaluate cross-lingual inferences.

lUsing lessons learned from DIF analyses to avoid future adaptation problems.

Page 67: Statistical and Qualitative Approaches for Facilitating ...

Concluding remarksWe continue to make progress on

methods for evaluating adapted tests,

but of course, more research is needed.

We continue to make progress on the process of test adaptation.

ITC Guidelines (2017) are a good foundation upon which we should build.

Page 68: Statistical and Qualitative Approaches for Facilitating ...

Resources for Learning MoreInternational Test Commission

http://www.intestcom.org/Recommended reading:1) Adapting Educational &

Psychological Tests for Cross-Cultural Assessment (Hambleton, Merendea, & Spielberger, 2005)

2) International Journal of Testing3) Cross-cultural research methods

in psychology (Matsumoto & van de Vijver, 2011)

Page 69: Statistical and Qualitative Approaches for Facilitating ...

Concluding remarks:

Lots of tough problems, but lots of progress.

Let’s move the field forward!

Page 70: Statistical and Qualitative Approaches for Facilitating ...

Thank you to OECD for the invitation!

And to you, for your attention.

Keep in touch

[email protected]

And see you in Montreal!

https://www.itc-conference.com