Maryam Bolouri
-
Upload
allame-tabatabaei -
Category
Education
-
view
283 -
download
2
Transcript of Maryam Bolouri
VALIDITY IN EDUCATIONAL ASSESSMENT CHAPTER ONE(NEWTON &SHAW, 2014)
Maryam Bolouri
Topics to be discussed:
Validity definition
Types of Validity
The history of validity
Validity 100years of debate
Agreement (bland consensus) among scholars:
The hallmark of quality in testing
The most important criterion for evaluating a test
Discrepancies:
Different perspectives and Meaning of validity:
1) Measurement concept (original scientific definition in 1920s)
2) Measurement and decision making concept 9btw these 2 extremes, scientific and ethical)
3) Measurement, decision making and impacts concept (social and ethical definition, concerns test use)
Why Validity is so confusing:1) Very large and disparate literature on V within edu
and psycho measurement
2) Very large and disparate literature on V within other disciplines
3) The official meaning of V has evolved over time
4) Gained stature and expansion over time
5) Related accounts are hard to read
6) V is employed in different ways and contexts, so it is unclear what the intended meaning is. If it is technical sense then it is referred to which discipline?
Validity (definition)
Validity is the property or quality of being valid, true, cogent or legally acceptable (every day definition)
Validity has been associated with different meaning across disciplines.
Validity theory: conceptual FW
Validation practice: investigation into V or a process of investigation guided by V theory
It derives from “validus” Latin word means strong, healthy, worthy
Validity across disciplines:
Philosophy and logic: if and only if it is not possible for all the premises to be true when its conclusion is false. It is a deductive argument .
It is not related to validation but the strength of validity argument
Law and economics
Genetic testing
Management
Edu and psych measurement
Validity for research or measurement ?? V for research (Campbell, 1957): relevant to
conclusions based on evidence, of 2 kinds:
Internal V: degree of confidence that the conclusions or observed effect is genuine for experimental group and the design rules out other irrelevant explanations
External V: th4e degree of confidence in conclusions that can be generalized from experimental group to the intended population. It is decreased if the sampling is biased or the process itself causes the effect.
Validity for research
External V is subdivided into:
1. Population V:confidence in the generalization of conclusions across populations
2. Ecological V:confidence in the generalization of conclusions across conditions
a) Outcome V: across dependent variables
b) Temporal V: across time
c) Treatment V: across treatment variations
Cook and Campbell 4 way classification or 4 major decisions (1979)
Internal V into: statistical conclusion V+ internal V
External V into: construct V+ external V
Validity is confidence in the credibility of description and interpretation or the legitimacy of the produced knowledge
It is the social consequence of qualitative research
Lather 3 main conceptions of V (1986):1. Face V: member checking
2. Construct V: systematized reflexivity of researcher’s theory in response to the data
3. Catalytic V: the facilitating the transformation of reality, reorienting, focusing, and energizing the participants
These two Vs are confusing:
1. Same key contributors in both theories
2. V for research ideas are borrowed from V for measurement
3. Similar few terms in both literature with different meanings
4. V for research is involved in all V for measurement , yet the reverse is not true.
Validity for measurement measure an attribute by a test
Individuals are measured to make decisions
The more accurate measurement, the better the decision
Professions are characterized in terms of the different attributes they need to measure
e.g.: manic depression decide treatment
math achievement place a student in a class
Ultimate purpose: improve the ratio of correct/incorrect decisions
Peculiarity of V for measurement
Result of a single test are used for multiple purposes or interpreted in terms of a variety of attributes. Is it possible??
Can each attribute be measured with sufficient accuracy?
How can it defend the use of results or support a claim to validity?
Kinds of V for measurement Cronbach (1949):
1. Logical V: based on logical analysis and content
2. Empirical V : based on empirical evidence such as correlation of scores and test
Approaches to investigating V from 1950s to 1970s
1. Content V: derivative of logical V2. Criterion V: derivative of empirical V(subdivided into concurrent and predictive V)1. Construct V: scientific and last resort V (when neither
could be relied on)
New kinds of V continue to be proposed even to the present day, and it is bizarre because since the mid 1980s it has been recognized officially that V is a unitary concept (Messick, 1980) there was only one kind of v and that is construct V
Measurement in edu and psych
Assessment
Performance evaluation
Diagnosis
It carries both technical and emotional baggage. It is used in its loosest sense and embraces anyone with pro remit for measurement, assessment, evaluation and diagnosis
Test or score means: Any structured assessment of behavior
Any measurement procedure
A set of procedure to elicit evaluate ot interpret a behavior
The outcome of test will be summarized as score, report, or profile to characterized individual in terms of the attribute being measured
V is a quality of this procedure, so when it is valid it is tantamount to: thumbs-up, green light, stamp of approval.
Review:
Test or measurement procedure is valid--claim of using the test for certain kind of attribute based on evidence making certain type of decision in future
1) What to measure?
A number of contenders are:
Characteristic to be measured (human)
Trait(human)
Disposition (human)
Construct: most frequent one
Attribute
Achievement, Attainment, Aptitude, attitude, proficiency, competence, etc.
Construct V:A terminological conundrum : If all of V is now Construct V , then construct is
redundant.
It is also misleading because it implies a traditional tradition that is no longer credible in this century
Large amount of literature of construct in edu and psych is specific to measurement and construct V
There is no straightforward solution to this problem.
Attribute: Why:
1. Minimize the confusion
2. Reserve the “construct” for talk about construct V
Certain attributes are of sig importance in our field: ( no universal agreement on their connotatin
Achievement (evaluative overtone)
Aptitude (innateness connotation)
Intelligence
Problem with Attribute names
They fall in and out of fashion
Every day names of them can change over time.
The scientific understanding of them may change
New names may be proposed for particular implications
For instance: SAT
1926: to assess academic readiness for university (scholastic aptitude test)
Connotation of innate and fixed ability akin to intelligence
1990: not measure sth innate (scholastic assessment test)
Later: SAT retained as a name in its own right
For instance: end-of-course test Achievement: evaluative overtone,
accomplishment after following a course of instruction, may be mastering for one learner
Attainment: assumed neutral, an attempt to master particular learning outcomes, again is specific to one individual
Competence or proficiency: capacity to do X, Y, or Z regardless of following a particular course of learning or instruction
V and R
Early definition: the degree to which a test measures what it is supposed to measure
Definition of R: consistency of outcome
V vs. R: accuracy vs. consistency
In the absence of consistency any claim to be able to measure accurately would be indefensible.
We might be consistently wrong, so it is not enough or sufficient. But a necessary condition for high measurement quality. It is just one facet of V.
The history of V in edu and psych field: Answers to two fundamental questions:
Different answers over the years to improve validation prax
1) what does it mean to claim V? (V theory)
2) how can a validity claim be substantiated (Validation prax)
V theory: No comprehensive coherent clear account of it
Mid 1950s: document prepared by committees of measurement professional from north America (AKA: technical recommendations or standards:
encapsulated version of all official statements
Succinct guidance on validation, meaning of V
Not only well-developed but ambiguous
Product of NA committees became the lingua franca for the world
Product of compromise rather than universal satisfaction with poor validation prax
main challenge of V theorists:move beyond the disparate heuristic principles of standards towards a comprehensive account of V Got closer to it from 1970s-1990s by scholarship
of Messick: comprehensive yet unclear , ling, dense, and philosophically challenging and viscous
Validity accounts: Cureton 1951, Messick 1989
Test validation: Cronbach 1971
Validation: Kane 2006
Three phases of history (first classification):
1. pre- Trinitarian:
2. Trinitarian ( content V, criterion V, construct V) 1950s---1970s holy trinity
3. Unitarian
Newton and Shaw classification ( 5 key phases):
1. Mid 1800s—1920 Gestational period
2. 1921---1951 period of crystallization
3. 1952---1974 period of fragmentation
4. 1975---1999 period of reunification
5. 2000---2012 period of deconstructionNo sharp line btw them, just crude attempt to
structure the course
Many of the transition correspond to publication of new version of standards
they captured the zeitgeist btw eras
Newton and Shaw classification ( 5 key phases):
They focus on
1. Conception of V
2. How to employ logical analyses and empirical evidence to substantiate a claim to V?
3. When to employ logical analyses and empirical evidence to substantiate a claim to V?
1) Gestational periodmid 1800-1920
1. Structured assessment: better decision making, facilitating outcomes, fairer for individuals and useful for society: introduction of written and local examinations in USA and England
2. More structure and less objectivity by the end of 19th
century: introduction of T/F, MC, completion, and standardized tests
3. Advances in statistical procedures, invention of Co.co, test of mental capacities
4. Early years of 20th: measurement movement all sort of test of all sorts of attributes in all formats ,success of test for placement and selection
2) A period of crystallization (1921-1951)
Development of Tests of many uses:
1. Test of edu achievement: judge students and schools
2. Test of intelligence: diagnose backwardness and excellence
3. Test of specific aptitude: vocational guidance
Importance of quality and control, seek consensus on the meaning of terms and procedures such as R and V
2) A period of crystallization (1921-1951)
How to establish a claim to V by 2 approaches:
1. Logical analysis of test content, group of expert practitioners scrutinized the content of the test and judge if it matches the content of curriculum or not.
2. Empirical evidence of correlation btw the test and the what was supposed to be measured
key question: what the test results ought to be correlated against? what criterion in order to judge the results? Expert judgments are valid?
2) A period of crystallization (1921-1951) Validate a short standardized test of
achievement against along comprehensive assessment of achievement that cover the full range of learning outcomes. High correlation with long one validate the test as measurement of full domain
Different communities based on their interest molded the definition
1. Psychologist with interest in aptitude prioritized empirical evidence of correlation
2. Educators with interest in achievement prioritized logical analysis of content
3) The fragmentation of V (1952-1974) Publication of first standards in 1952 to govern info of
test producer by a committee of APA chaired by Cronbach
Previous classifications of V into types such as
1) Logical V and empirical V- 1949
2) Curricular V, statistical V, psychological and logical V 9neither of the previous 2, arm-chair dissection of the total process- 1943
3) 4 types of V: content, predictive, status, congruent V-1952
Final publication 1954: content, predictive, concurrent, construct
3) The fragmentation of V (1952-1974)
Intention of construct V: when neither the logical analysis nor the empirical evidence were regarded as sufficient.
Certain types of tests are evaluated in relation to a universe of content (content V)
Aptitude tests are evaluated in relation to criterion measure
For other tests there was no yardstick and need a different procedure. Such as personality tests
3) The fragmentation of V (1952-1974) Construct V determines what psychological
construct accounts for test performance
Construct: means postulated attribute that is manifest in test performance
It subsumed both logical analysis and empirical evidence or any other forms of evidence to be brought on psychological meaning of score so it is quintessentially scientific and relied on a theory
3) The fragmentation of V (1952-1974)
Second Revision of standards in 1966 : 4 types of V collapsed into 3
1. Content V
2. Criterion related V
3. Construct V
Third revision of standards in 1974
They are not mutually exclusive but V theory and validation fragmented along these lines. “validity types” as alternatives to validation. They are not preferable over each other.
Problem: criterion V definition couldn’t be reconciled with classis definition. Predictor tests were black boxes and irrelevant if not predict the criterion with accuracy
4) The reunification of V-Messick years (1974-1999)
4th edition of standards in 1985 and 5th in 1999
All V ought to be understood as construct V.
Demolished the distinction btw V for measurement and V for prediction.
3 fundamental imperatives for validation:
1. Establish the criterion measure measured what it was supposed to measure
2. Establish the aptitude test measured what it was supposed to measure
3. Establish a theoretical rationale and presenting evidence for aptitude test
4) The reunification of V-Messick years (1974-1999) Before 1970s: blind empiricism and claim V
based on logical analysis
Messick upped the ante:
1. Test performance must be representative of learning outcomes.
2. Variance of test scores is attributable to construct relevant factors
3. Twin threats of construct underrep and construct irrelevant variance
Messick triumph: Validation: integration of logical analysis, empirical
evidence to substantiate the claim
Validation: Scientific laborious inquiry
Encourage evaluators to accumulate evidence and analysis as much as they can ant stake a claim to V based on single study in isolation
Locate ethic at the heart of V theory
(overemphasize the scientific evaluation of values, and down play ethical evaluation)
Scientific investigation of consequences from testing
Failed to provide a persuasive synthesis of science and ethics and left a rift btw measurement professionals
5) The deconstruction of V (2000-2012)
1. Validation prax by argumentation: construct and defend V claims
Where to begin? (interpretation and use of score) How to proceed? (make explicit claims and its
assumptions) When to stop? (coherent, complete argument with
plausible inferences and assumptions)
Messick: emphasized sources of evidence and claim to V as overall evaluative judgment
Kane: emphasized the integration of sources within overall V argument, how to construct and defend claims, a methodology for subdividing the V into chunks
5) The deconstruction of V (2000-2012)2. Development of new V theory
Strong rejection of Cronbach and Messick ‘s Validation which was a truly epic, never ending, laborious and interminable quest or undertaking. Validation was development of theory relate one theoretical construct to others within a large network of theoretical constructs
5) The deconstruction of V (2000-2012) To him, Validation is dependent on particular
interpretation and use of results that the test user has in mind.
If it is simple, small amount of evidence is needed
3. Drew a distinction btw observable and theoretical attributes: interpretations of theoretical constructs are scientific inquiry (traditional construct validation) while interpretations of observable attributes such as proficiency, vocab knowledge, … are far easier.
So deconstruction means downplaying the sig of theoretical constructs.
5) The deconstruction of V (2000-2012) Cizek: no integration of scientific and ethical
analysis is possible since they are mutually incompatible arguments . That’s why there is disjunction btw theory of V and prax of validation. It is simply not feasible.
Denny Borsboom: V is not a property of interpretation of test score, but a property of test
Mitchel, Moss, Embretson
New FW to evaluate test policy by Newton and Shaw