Presentacion Daniel Koretz
-
Upload
julio-ignacio-bravo-villarroel -
Category
Documents
-
view
226 -
download
0
Transcript of Presentacion Daniel Koretz
-
8/10/2019 Presentacion Daniel Koretz
1/67
Using tests for monitoring and accountability
Prof. Daniel Koretz
Harvard Graduate School of Education
Agencia de Calidad de la EducacinSantiago, Chile
November 3, 2014
-
8/10/2019 Presentacion Daniel Koretz
2/67
Two basic questions posed by the Agencia
de Calidad de la Educacin
Value and limitations of standardized tests as
indicators of the quality of education
The use of tests to improve learning
What can go wrong when tests are used for
accountability?
(2)
-
8/10/2019 Presentacion Daniel Koretz
3/67
Two basic questions posed by the Agencia
de Calidad de la Educacin, revised
Value and limitations of standardized tests as
indicators of the quality of education the performance
of students
Test scores describe what students can do; they
do not explain why they can do it
The use of tests to improve learning
What can go wrong when tests are used for
accountability?
(3)
-
8/10/2019 Presentacion Daniel Koretz
4/67
Topics
Session 1:
What is a test? The sampling principle of testing
Examples of design choices for different purposes
Session 2:
What are the risks of test-based accountability?
Undesirable changes in instruction
Score inflation
Why do these occur?
(4)
-
8/10/2019 Presentacion Daniel Koretz
5/67
Part I: The value and limitations of
standardized tests
(5)
-
8/10/2019 Presentacion Daniel Koretz
6/67
What is a test?
A test is a very small sample of behavior from thestudent
It is valuable only to the extent that it lets us estimate
mastery of the domainthe knowledge and skills itrepresents
For example, 40 or 50 test items are often used to
estimate something like cumulative mastery of
mathematics through grade 8
(6)
-
8/10/2019 Presentacion Daniel Koretz
7/67
7
Sampling to obtain a test
1
2 Student achievement Other
3 Domains selected for testing Untested domains
4 Tested parts of selected domains Untested portions of domains
5 Tested sample Untested sample
Goals of education
-
8/10/2019 Presentacion Daniel Koretz
8/67
8
The sampling principle of testing:
analogy of a political poll
In September, a Connecta poll of 601 people predicted
second round results: 57,6% for Bachelet, 23,1% for
Matthei, 19,3% other, dont know
Actual second-round vote: 62,2% for Bachelet, 37,8%
for Matthei
Would you have cared how those particular 601 people
actually voted?
Why is information from those 601 people valuable?
-
8/10/2019 Presentacion Daniel Koretz
9/67
How is a test not like a poll?
Similar: In polling, we sample people; in testing, wesample content
Not similar: in testing, we have other decisions after
sampling content, for example: How is content represented on the test?
Graphically? Verbally? What item format?
What are the task demandsfor example, scoring
rubrics?
(9)
-
8/10/2019 Presentacion Daniel Koretz
10/67
How similar are tested representations?Calculate the area of basic polygons drawn on a coordinate plane
-
8/10/2019 Presentacion Daniel Koretz
11/67
11
What are the consequences of incomplete sampling?
All cases:
Systematically incomplete evaluation of education
Low pressure testing: modest effects on scores
Measurement error (uncertainty): fluctuations in scores
Differences in results among tests: sometimes small,
but occasionally large, for example, TIMSS vs. PISA
High pressure (accountability): very large effects
Incentives to focus preparation on the tested sample,
not the domain Narrowed instruction, bad test preparation
Score inflation
-
8/10/2019 Presentacion Daniel Koretz
12/67
Why use standardized tests
Standardized tests were originally designed to provide
supplementary, specialized information that teachersdo not have already
Tests provide information that is consistent across
classrooms, schools, and years (unlike teachers
grades)
Tests are very efficient: they provide substantial
information from a short amount of time
Tests can be designed to support a number of different
uses
(12)
-
8/10/2019 Presentacion Daniel Koretz
13/67
Topics
Session 1:
What is a test? The sampling principle of testing
Examples of design choices for different purposes
Session 2:
What are the risks of test-based accountability?
Undesirable changes in instruction
Score inflation
Why do these occur?
(13)
-
8/10/2019 Presentacion Daniel Koretz
14/67
Some common uses of standardized tests
Monitor performance of a national or regional system
Evaluate performance relative to standards or
normatively (relative to the performance of others)
Provide pedagogically useful information to educators,
for example, formative assessment results
Hold educators accountable for student performance
(14)
-
8/10/2019 Presentacion Daniel Koretz
15/67
Design trade-offs
Different purposes for tests suggest different designs
Designing a test to be better for one function may
make it worse for other functions
For example, a test designed for summative evaluationis often poorly designed to provide instructional
feedback
Sometimes, using a test for one purpose will make itless valuable for others
Session 2: score inflation from accountability
(15)
-
8/10/2019 Presentacion Daniel Koretz
16/67
-
8/10/2019 Presentacion Daniel Koretz
17/67
Difficulty
Items that are too hard or too easy for a student will
produce an unreliable score
Tests that are too hard or too easy may have floor or
ceiling effects
Cannot accurately show differences or changes inperformance
Will distort trends
So tests best suited for low-performing students maybe poorly suited for high-performing students
(17)
-
8/10/2019 Presentacion Daniel Koretz
18/67
Raw scores from a test that has become too easy
(18)
-
8/10/2019 Presentacion Daniel Koretz
19/67
Sampling of students and schools
To hold schools accountable, tests must be frequent,and many or all students must be tested
To hold teachers accountable, all students must be
tested with the same items
To monitor the system, you can test less often and use
sparse matrix sampling:
Only a sample of students tested in each school Students are given different items, so the content
can be broader
(19)
-
8/10/2019 Presentacion Daniel Koretz
20/67
The design of test items for different purposes
For summative purposes, it is enough to whetherstudents can do something
For formative purposes, you want to know why some
students are unable to perform the task Knowing why students fail allows teachers to
modify instruction
May show which incorrect ideas cause students to
make mistakes May break a complex skill into its constituent parts
(20)
-
8/10/2019 Presentacion Daniel Koretz
21/67
Ideal items for a formative test
Should provide information for improving instructionthat teachers may not have without the test
Should reveal the sources of errors
Should not look like the items on the summative test
If the items are similar to the summative test, this
will encourage narrowing of instruction and score
inflation
(21)
-
8/10/2019 Presentacion Daniel Koretz
22/67
A diagnostic item for elementary fractions
(22)
In which of the following diagrams is one-quarter of thearea shaded?
Source: National Council of Teachers of Mathematics,http://www.nctm.org/news/content.aspx?id=11474
Tells you why a student answers incorrectly, not justwhether she answers incorrectly
http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474 -
8/10/2019 Presentacion Daniel Koretz
23/67
Reporting for different purposes
For summative purposes, a single score may tell youwhich groups are performing better than others
For pedagogical purposes, educators need more detail
about different aspects of performance, to know whereimprovements are needed
(23)
-
8/10/2019 Presentacion Daniel Koretz
24/67
An old-fashioned option for pedgogically
useful information: norm-referenced reporting
How do you know that: 2,5 minutes/km is a fast time for a runner?
4,8 l/km is good gasoline economy?
We compare to the distribution of speed or economy
Norm-referenced reporting compares each score to the a
relevant distribution of scores
Norm-referenced reporting offers teachers:
A basis for evaluating their own expectations
A way to compare performance across different areas
(are my students better at computation than at
problem-solving?)
(24)
-
8/10/2019 Presentacion Daniel Koretz
25/67
An example of a report from a norm-referenced test
(25)
-
8/10/2019 Presentacion Daniel Koretz
26/67
-
8/10/2019 Presentacion Daniel Koretz
27/67
Compare level of detail to the previous slides
(27)
-
8/10/2019 Presentacion Daniel Koretz
28/67
Summary: what tests can do
Standardized tests can provide important informationto policymakers and educators with small demands on
time
No one test can serve every goaldesign must bematched to purpose
To some degree, the various purposes compete,
create conflicting demands for design
To resolve this, we need either a clear choice among
uses or multiple tests
(28)
-
8/10/2019 Presentacion Daniel Koretz
29/67
What tests cannot do
Scores cannot provide a complete evaluation of aprogram or school
Some important goals are omitted
Scores taken alone do not isolate the contributions ofteachers or schools
Test scores describe; they do not explain
Many factors other than schooling influence scores
Efforts to separate the effects of schooling are
complex and controversial
(29)
-
8/10/2019 Presentacion Daniel Koretz
30/67
-
8/10/2019 Presentacion Daniel Koretz
31/67
-
8/10/2019 Presentacion Daniel Koretz
32/67
-
8/10/2019 Presentacion Daniel Koretz
33/67
Topics
Session 1:
What is a test? The sampling principle of testing
Examples of design choices for different purposes
Session 2:
What are the risks of test-based accountability?
Undesirable changes in instruction
Score inflation
Why do these occur?
(33)
-
8/10/2019 Presentacion Daniel Koretz
34/67
34
What we learned from the US experience
Effects on educational practice are mixed
Some improvements
Many undesirable effectsbad test preparation,
other gaming
Scores can become severely inflated (increase muchmore than actual learning)
Overall improvement is exaggeratedoftenseverely
Relative effectiveness is estimated incorrectly Teachers, schools, and systems ranked
incorrectly
Can create an illusion of greater equity
-
8/10/2019 Presentacion Daniel Koretz
35/67
35
What we dont know
What is the net effect on student achievement?
Weak research designs, weaker data Some evidence of inconsistent, modest effects in
elementary math, none in reading
Effects are likely to vary across contexts
Which types of test-based accountability systems are
best?
Which programs maximize real improvements
Which programs minimize gaming, bad testpreparation, & score inflation
Reason: grossly inadequate research and evaluation
-
8/10/2019 Presentacion Daniel Koretz
36/67
36
Campbells Law (1975)
The more any quantitative social indicator is
used for social decision making, the more
subject it will be to corruption pressures and
the more apt it will be to distort and corrupt the
social processes it is intended to monitor.
Donald T. Campbell, (1975). Assessing the impact ofplanned social change. In G. M. Lyons (Ed.), SocialResearch And Public Policies : The Dartmouth/OECDConference.
-
8/10/2019 Presentacion Daniel Koretz
37/67
-
8/10/2019 Presentacion Daniel Koretz
38/67
Campbells Law in testing
Raising scores becomes the primary goal
Educators find ways to raise scores on the specific test
used for accountability
Scores are inflated: they increase more than learning
Overall improvement is exaggerated
Relative improvement is estimated incorrectly, and
schools are ranked incorrectly
(38)
-
8/10/2019 Presentacion Daniel Koretz
39/67
39
Logic of studies of score inflation
Scores are meaningful onlyif they generalize to the
domain
A poll is useful only if its results generalize to the
entire electorate
If gains generalize to the domain, they must generalizeto other tests of the same domain
Gains on a high-stakes test should generalize to a
lower-stakes audit test
If a poll is accurate, other good polls will showsimilar results
-
8/10/2019 Presentacion Daniel Koretz
40/67
40
Good versus bad preparation for a test
Good: gives students knowledge and skills that theycan apply elsewhere
In later education
In later employment
Therefore, on other tests
Bad: generatesscore inflation: test-specific gainsthat
do not generalize beyond that test
-
8/10/2019 Presentacion Daniel Koretz
41/67
41
Performance on coached and uncoached tests
3.0
3.2
3.4
3.6
3.8
4.0
4.2
4.4
1985 1986 1987 1988 1989 1990 1991
Year
GradeE
quivalents
Test C Test B
District tests
Koretz, et al., test
SOURCE: Adapted from Koretz, Linn, Dunbar, and Shepard (1991)
-
8/10/2019 Presentacion Daniel Koretz
42/67
-
8/10/2019 Presentacion Daniel Koretz
43/67
43
Reading change, grade 4 KIRIS and NAEP,
1992-1994
KIRIS NAEP
Gain in scale scores 18.8 -1
Standardized Gain 0.76 -0.03
Trends by Race on New York State vs NAEP
-
8/10/2019 Presentacion Daniel Koretz
44/67
Trends by Race on New York State vs. NAEP
Standardized Mean Scale Scores by Race on 8thGrade Math
-
8/10/2019 Presentacion Daniel Koretz
45/67
Inconsistency between school ratings calculated from
high-stakes and lower-stakes tests:Best and worst cases from 48 correlations across grade, year, subject, and model
45
= .27, reading, 2000 = .63, math, 2000
-
8/10/2019 Presentacion Daniel Koretz
46/67
Topics
Session 1:
What is a test? The sampling principle of testing
Examples of design choices for different purposes
Session 2:
What are the risks of test-based accountability?
Undesirable changes in instruction
Score inflation
Why do these occur?
(46)
-
8/10/2019 Presentacion Daniel Koretz
47/67
Why inflation occurs
Tests show predictableemphases, omissions, and
forms of presentation over time.
Some are intentional, for technical reasons
Some are accidental, or to save time and money
Test preparation can focus on these patterns:
Focusing instruction on emphasized content, at the
cost of other content relevant to the inference
Focusing on incidental characteristics of the test,
such as item format
(47)
-
8/10/2019 Presentacion Daniel Koretz
48/67
48
Ways to raise scores
Teaching more
Working harder
Working more effectively
Reallocation
Coaching
Cheating
Changing who is tested
M d il li i i
-
8/10/2019 Presentacion Daniel Koretz
49/67
More detail on sampling in constructing a test
49
1. Domain selected for testing (math, ELA, etc.)
2. Elements from domain included instandards
Elements from domain omittedfrom standards
3. Tested subset of standards Untested subset of standards
4. Tested material from within testedstandards
Untested material from withintested standards
5. Tested representations Untested representations
R ll ti
-
8/10/2019 Presentacion Daniel Koretz
50/67
50
Reallocation
Shifting instructional resources to fit the testingprogram
Within a subject
Between subjects
Within a subject, can lead to either meaningful change
or inflation
Inflates if material getting decreasedemphasis is
also important for the inference
Narrowed instruction is a type of reallocation
-
8/10/2019 Presentacion Daniel Koretz
51/67
C hi
-
8/10/2019 Presentacion Daniel Koretz
52/67
52
Coaching
Focusing preparation on substantively unimportantdetails of the test
Minor, unimportant details of content
Details of the presentation of material
Includes test-taking tricks (e.g., process of elimination,
plug-in)
Can inflate scores or simply waste time
H i il t t d t ti ?
-
8/10/2019 Presentacion Daniel Koretz
53/67
53
How similar are tested representations?
2008 item, New York grade 7 math test
Which tool is most appropriate for measuring the mass of aserving of cheese?
a. rulerb. thermometerc. measuring cupd. weighing scale
2009 it N Y k d 7 th t t
-
8/10/2019 Presentacion Daniel Koretz
54/67
54
2009 item, New York grade 7 math test
Which tool would be the most appropriate for Natasha touse when finding the mass of a watermelon?
a. scale
b. inch rulerc. meter stickd. measuring cup
H i il t t d t ti ?
-
8/10/2019 Presentacion Daniel Koretz
55/67
55
How similar are tested representations?
NY 7N7: Compare numbers written in scientific notation.
A l f hi ( h ti ?)
-
8/10/2019 Presentacion Daniel Koretz
56/67
56
An example of coaching (cheating?)
The question on the review sheet for[the]
examreads in part:
The average amount that each band member must
raise is a function of the number of band members, b,
with the rule f(b)=12000/b.
The question on the actual test reads in part:
The average amount each cheerleader must pay is a
function of the number of cheerleaders, n, with the rulef(n)=420/n.
Strauss, V., The Washington Post, July 10, 2001, p. A09
-
8/10/2019 Presentacion Daniel Koretz
57/67
-
8/10/2019 Presentacion Daniel Koretz
58/67
Recommendation 1: make the evaluation
-
8/10/2019 Presentacion Daniel Koretz
59/67
Recommendation 1: make the evaluation
and accountability system broad
Do not rely only or excessively on standardized tests
Evaluate other outcomes
Evaluatepracticesas well as outcomes
May need to use subjectiveas well as objective
measures
59
Recommendation 2: couple evaluation and
-
8/10/2019 Presentacion Daniel Koretz
60/67
Recommendation 2: couple evaluation and
accountability with training and support
Many teachers need help, not just incentives, to
improve instruction
Provide support, for example, training for better
teaching
60
Recommendation 3: Use summative
-
8/10/2019 Presentacion Daniel Koretz
61/67
Recommendation 3: Use summative
tests appropriately
Report in detail Show teachers what needs improving, not just how
high or low they score
Add formative tests for difficult topics, if possible
Set realisticperformance targets that teachers can
reach by appropriate methods
Creates less incentive to use bad test preparation
61
-
8/10/2019 Presentacion Daniel Koretz
62/67
Supplementary slides
62
Math trends KIRIS and ACT
-
8/10/2019 Presentacion Daniel Koretz
63/67
63
Math trends, KIRIS and ACT
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Year
Standar
d
Deviation
KIRIS
ACT
1992 199519941993
Standardized mathematics gains
-
8/10/2019 Presentacion Daniel Koretz
64/67
64
Standardized mathematics gains
in Kentucky, 1992-1996
KIRIS NAEP
Grade 4 0.61 0.17
Grade 8 0.52 0.13
One look at the TX miracle (Klein et al 2000)
-
8/10/2019 Presentacion Daniel Koretz
65/67
65
One look at the TX miracle (Klein, et al. 2000)
Item from G8 MCAS
-
8/10/2019 Presentacion Daniel Koretz
66/67
66
Eva has four sets of straws. The
measurements of the straws are given
below. Which set of straws could not be
used to form a triangle?
A. Set 1: 4 cm, 4 cm, 7 cm
B. Set 2: 2 cm, 3 cm, 8 cm
C. Set 3: 3 cm, 4 cm, 5 cm
D. Set 4: 5 cm, 12 cm, 13 cm
Item from G8 MCAS
S h l Cl ifi ti
-
8/10/2019 Presentacion Daniel Koretz
67/67
School Classification
SchoolClassification
DIMENSIONS INDICATORS WEIGHT
Learning standards
outcomes
Learning standards 67,0 %
Simce scores 3,3 %
Simce trend 3,3 %
Other quality indicators
School motivation and
academic self-esteem
3,3 %
School climate 3,3 %
Civic participation and
citizenship
3,3 %
Healthy habits 3,3 %Attendance 3,3 %
Gender equity 3,3 %
School retention 3,3 %
T h i l f i l 3 3 %