Maryam Bolouri

VALIDITY IN EDUCATIONAL ASSESSMENT CHAPTER ONE(NEWTON &SHAW, 2014)

Maryam Bolouri

Topics to be discussed:

Validity definition

Types of Validity

The history of validity

Validity 100years of debate

Agreement (bland consensus) among scholars:

The hallmark of quality in testing

The most important criterion for evaluating a test

Discrepancies:

Different perspectives and Meaning of validity:

1) Measurement concept (original scientific definition in 1920s)

2) Measurement and decision making concept 9btw these 2 extremes, scientific and ethical)

3) Measurement, decision making and impacts concept (social and ethical definition, concerns test use)

Why Validity is so confusing:1) Very large and disparate literature on V within edu

and psycho measurement

2) Very large and disparate literature on V within other disciplines

3) The official meaning of V has evolved over time

4) Gained stature and expansion over time

5) Related accounts are hard to read

6) V is employed in different ways and contexts, so it is unclear what the intended meaning is. If it is technical sense then it is referred to which discipline?

Validity (definition)

Validity is the property or quality of being valid, true, cogent or legally acceptable (every day definition)

Validity has been associated with different meaning across disciplines.

Validity theory: conceptual FW

Validation practice: investigation into V or a process of investigation guided by V theory

It derives from “validus” Latin word means strong, healthy, worthy

Validity across disciplines:

Philosophy and logic: if and only if it is not possible for all the premises to be true when its conclusion is false. It is a deductive argument .

It is not related to validation but the strength of validity argument

Law and economics

Genetic testing

Management

Edu and psych measurement

Validity for research or measurement ?? V for research (Campbell, 1957): relevant to

conclusions based on evidence, of 2 kinds:

Internal V: degree of confidence that the conclusions or observed effect is genuine for experimental group and the design rules out other irrelevant explanations

External V: th4e degree of confidence in conclusions that can be generalized from experimental group to the intended population. It is decreased if the sampling is biased or the process itself causes the effect.

Validity for research

External V is subdivided into:

1. Population V:confidence in the generalization of conclusions across populations

2. Ecological V:confidence in the generalization of conclusions across conditions

a) Outcome V: across dependent variables

b) Temporal V: across time

c) Treatment V: across treatment variations

Cook and Campbell 4 way classification or 4 major decisions (1979)

Internal V into: statistical conclusion V+ internal V

External V into: construct V+ external V

Validity is confidence in the credibility of description and interpretation or the legitimacy of the produced knowledge

It is the social consequence of qualitative research

Lather 3 main conceptions of V (1986):1. Face V: member checking

2. Construct V: systematized reflexivity of researcher’s theory in response to the data

3. Catalytic V: the facilitating the transformation of reality, reorienting, focusing, and energizing the participants

These two Vs are confusing:

1. Same key contributors in both theories

2. V for research ideas are borrowed from V for measurement

3. Similar few terms in both literature with different meanings

4. V for research is involved in all V for measurement , yet the reverse is not true.

Validity for measurement measure an attribute by a test

Individuals are measured to make decisions

The more accurate measurement, the better the decision

Professions are characterized in terms of the different attributes they need to measure

e.g.: manic depression decide treatment

math achievement place a student in a class

Ultimate purpose: improve the ratio of correct/incorrect decisions

Peculiarity of V for measurement

Result of a single test are used for multiple purposes or interpreted in terms of a variety of attributes. Is it possible??

Can each attribute be measured with sufficient accuracy?

How can it defend the use of results or support a claim to validity?

Kinds of V for measurement Cronbach (1949):

1. Logical V: based on logical analysis and content

2. Empirical V : based on empirical evidence such as correlation of scores and test

Approaches to investigating V from 1950s to 1970s

1. Content V: derivative of logical V2. Criterion V: derivative of empirical V(subdivided into concurrent and predictive V)1. Construct V: scientific and last resort V (when neither

could be relied on)

New kinds of V continue to be proposed even to the present day, and it is bizarre because since the mid 1980s it has been recognized officially that V is a unitary concept (Messick, 1980) there was only one kind of v and that is construct V

Measurement in edu and psych

Assessment

Performance evaluation

Diagnosis

It carries both technical and emotional baggage. It is used in its loosest sense and embraces anyone with pro remit for measurement, assessment, evaluation and diagnosis

Test or score means: Any structured assessment of behavior

Any measurement procedure

A set of procedure to elicit evaluate ot interpret a behavior

The outcome of test will be summarized as score, report, or profile to characterized individual in terms of the attribute being measured

V is a quality of this procedure, so when it is valid it is tantamount to: thumbs-up, green light, stamp of approval.

Review:

Test or measurement procedure is valid--claim of using the test for certain kind of attribute based on evidence making certain type of decision in future

1) What to measure?

A number of contenders are:

Characteristic to be measured (human)

Trait(human)

Disposition (human)

Construct: most frequent one

Attribute

Achievement, Attainment, Aptitude, attitude, proficiency, competence, etc.

Construct V:A terminological conundrum : If all of V is now Construct V , then construct is

redundant.

It is also misleading because it implies a traditional tradition that is no longer credible in this century

Large amount of literature of construct in edu and psych is specific to measurement and construct V

There is no straightforward solution to this problem.

Attribute: Why:

1. Minimize the confusion

2. Reserve the “construct” for talk about construct V

Certain attributes are of sig importance in our field: ( no universal agreement on their connotatin

Achievement (evaluative overtone)

Aptitude (innateness connotation)

Intelligence

Problem with Attribute names

They fall in and out of fashion

Every day names of them can change over time.

The scientific understanding of them may change

New names may be proposed for particular implications

For instance: SAT

1926: to assess academic readiness for university (scholastic aptitude test)

Connotation of innate and fixed ability akin to intelligence

1990: not measure sth innate (scholastic assessment test)

Later: SAT retained as a name in its own right

For instance: end-of-course test Achievement: evaluative overtone,

accomplishment after following a course of instruction, may be mastering for one learner

Attainment: assumed neutral, an attempt to master particular learning outcomes, again is specific to one individual

Competence or proficiency: capacity to do X, Y, or Z regardless of following a particular course of learning or instruction

V and R

Early definition: the degree to which a test measures what it is supposed to measure

Definition of R: consistency of outcome

V vs. R: accuracy vs. consistency

In the absence of consistency any claim to be able to measure accurately would be indefensible.

We might be consistently wrong, so it is not enough or sufficient. But a necessary condition for high measurement quality. It is just one facet of V.

The history of V in edu and psych field: Answers to two fundamental questions:

Different answers over the years to improve validation prax

1) what does it mean to claim V? (V theory)

2) how can a validity claim be substantiated (Validation prax)

V theory: No comprehensive coherent clear account of it

Mid 1950s: document prepared by committees of measurement professional from north America (AKA: technical recommendations or standards:

encapsulated version of all official statements

Succinct guidance on validation, meaning of V

Not only well-developed but ambiguous

Product of NA committees became the lingua franca for the world

Product of compromise rather than universal satisfaction with poor validation prax

main challenge of V theorists:move beyond the disparate heuristic principles of standards towards a comprehensive account of V Got closer to it from 1970s-1990s by scholarship

of Messick: comprehensive yet unclear , ling, dense, and philosophically challenging and viscous

Validity accounts: Cureton 1951, Messick 1989

Test validation: Cronbach 1971

Validation: Kane 2006

Three phases of history (first classification):

1. pre- Trinitarian:

2. Trinitarian ( content V, criterion V, construct V) 1950s---1970s holy trinity

3. Unitarian

Newton and Shaw classification ( 5 key phases):

1. Mid 1800s—1920 Gestational period

2. 1921---1951 period of crystallization

3. 1952---1974 period of fragmentation

4. 1975---1999 period of reunification

5. 2000---2012 period of deconstructionNo sharp line btw them, just crude attempt to

structure the course

Many of the transition correspond to publication of new version of standards

they captured the zeitgeist btw eras

Newton and Shaw classification ( 5 key phases):

They focus on

1. Conception of V

2. How to employ logical analyses and empirical evidence to substantiate a claim to V?

3. When to employ logical analyses and empirical evidence to substantiate a claim to V?

1) Gestational periodmid 1800-1920

1. Structured assessment: better decision making, facilitating outcomes, fairer for individuals and useful for society: introduction of written and local examinations in USA and England

2. More structure and less objectivity by the end of 19th

century: introduction of T/F, MC, completion, and standardized tests

3. Advances in statistical procedures, invention of Co.co, test of mental capacities

4. Early years of 20th: measurement movement all sort of test of all sorts of attributes in all formats ,success of test for placement and selection

2) A period of crystallization (1921-1951)

Development of Tests of many uses:

1. Test of edu achievement: judge students and schools

2. Test of intelligence: diagnose backwardness and excellence

3. Test of specific aptitude: vocational guidance

Importance of quality and control, seek consensus on the meaning of terms and procedures such as R and V

2) A period of crystallization (1921-1951)

How to establish a claim to V by 2 approaches:

1. Logical analysis of test content, group of expert practitioners scrutinized the content of the test and judge if it matches the content of curriculum or not.

2. Empirical evidence of correlation btw the test and the what was supposed to be measured

key question: what the test results ought to be correlated against? what criterion in order to judge the results? Expert judgments are valid?

2) A period of crystallization (1921-1951) Validate a short standardized test of

achievement against along comprehensive assessment of achievement that cover the full range of learning outcomes. High correlation with long one validate the test as measurement of full domain

Different communities based on their interest molded the definition

1. Psychologist with interest in aptitude prioritized empirical evidence of correlation

2. Educators with interest in achievement prioritized logical analysis of content

3) The fragmentation of V (1952-1974) Publication of first standards in 1952 to govern info of

test producer by a committee of APA chaired by Cronbach

Previous classifications of V into types such as

1) Logical V and empirical V- 1949

2) Curricular V, statistical V, psychological and logical V 9neither of the previous 2, arm-chair dissection of the total process- 1943

3) 4 types of V: content, predictive, status, congruent V-1952

Final publication 1954: content, predictive, concurrent, construct

3) The fragmentation of V (1952-1974)

Intention of construct V: when neither the logical analysis nor the empirical evidence were regarded as sufficient.

Certain types of tests are evaluated in relation to a universe of content (content V)

Aptitude tests are evaluated in relation to criterion measure

For other tests there was no yardstick and need a different procedure. Such as personality tests

3) The fragmentation of V (1952-1974) Construct V determines what psychological

construct accounts for test performance

Construct: means postulated attribute that is manifest in test performance

It subsumed both logical analysis and empirical evidence or any other forms of evidence to be brought on psychological meaning of score so it is quintessentially scientific and relied on a theory

3) The fragmentation of V (1952-1974)

Second Revision of standards in 1966 : 4 types of V collapsed into 3

1. Content V

2. Criterion related V

3. Construct V

Third revision of standards in 1974

They are not mutually exclusive but V theory and validation fragmented along these lines. “validity types” as alternatives to validation. They are not preferable over each other.

Problem: criterion V definition couldn’t be reconciled with classis definition. Predictor tests were black boxes and irrelevant if not predict the criterion with accuracy

4) The reunification of V-Messick years (1974-1999)

4th edition of standards in 1985 and 5th in 1999

All V ought to be understood as construct V.

Demolished the distinction btw V for measurement and V for prediction.

3 fundamental imperatives for validation:

1. Establish the criterion measure measured what it was supposed to measure

2. Establish the aptitude test measured what it was supposed to measure

3. Establish a theoretical rationale and presenting evidence for aptitude test

4) The reunification of V-Messick years (1974-1999) Before 1970s: blind empiricism and claim V

based on logical analysis

Messick upped the ante:

1. Test performance must be representative of learning outcomes.

2. Variance of test scores is attributable to construct relevant factors

3. Twin threats of construct underrep and construct irrelevant variance

Messick triumph: Validation: integration of logical analysis, empirical

evidence to substantiate the claim

Validation: Scientific laborious inquiry

Encourage evaluators to accumulate evidence and analysis as much as they can ant stake a claim to V based on single study in isolation

Locate ethic at the heart of V theory

(overemphasize the scientific evaluation of values, and down play ethical evaluation)

Scientific investigation of consequences from testing

Failed to provide a persuasive synthesis of science and ethics and left a rift btw measurement professionals

5) The deconstruction of V (2000-2012)

1. Validation prax by argumentation: construct and defend V claims

Where to begin? (interpretation and use of score) How to proceed? (make explicit claims and its

assumptions) When to stop? (coherent, complete argument with

plausible inferences and assumptions)

Messick: emphasized sources of evidence and claim to V as overall evaluative judgment

Kane: emphasized the integration of sources within overall V argument, how to construct and defend claims, a methodology for subdividing the V into chunks

5) The deconstruction of V (2000-2012)2. Development of new V theory

Strong rejection of Cronbach and Messick ‘s Validation which was a truly epic, never ending, laborious and interminable quest or undertaking. Validation was development of theory relate one theoretical construct to others within a large network of theoretical constructs

5) The deconstruction of V (2000-2012) To him, Validation is dependent on particular

interpretation and use of results that the test user has in mind.

If it is simple, small amount of evidence is needed

3. Drew a distinction btw observable and theoretical attributes: interpretations of theoretical constructs are scientific inquiry (traditional construct validation) while interpretations of observable attributes such as proficiency, vocab knowledge, … are far easier.

So deconstruction means downplaying the sig of theoretical constructs.

5) The deconstruction of V (2000-2012) Cizek: no integration of scientific and ethical

analysis is possible since they are mutually incompatible arguments . That’s why there is disjunction btw theory of V and prax of validation. It is simply not feasible.

Denny Borsboom: V is not a property of interpretation of test score, but a property of test

Mitchel, Moss, Embretson

New FW to evaluate test policy by Newton and Shaw

Maryam Bolouri

Education

Transcript of Maryam Bolouri