Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame
description
Transcript of Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame
![Page 1: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/1.jpg)
Andy Hegedus, Ed.D.
June 2012
Assessment Literacy in a
Teacher Evaluation Frame
![Page 2: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/2.jpg)
• How many of you think your literacy with assessments in general is “Good” or better?
• How many of you are currently figuring out how to use assessment data thoughtfully in a Teacher Evaluation process?
Trying to gauge my audience and adjust my speed . . .
![Page 3: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/3.jpg)
• What we’ve known to be true is now being shown to be true– Using data thoughtfully improves student
achievement• There are dangers present however
– Unintended Consequences
Go forth thoughtfullywith care
![Page 4: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/4.jpg)
“What gets measured (and attended to), gets done”
Remember the old adage?
![Page 5: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/5.jpg)
• NCLB– Cast light on inequities– Improved performance of “Bubble Kids”– Narrowed taught curriculum
An infamous example
![Page 6: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/6.jpg)
It’s what we do that counts
A patient’s health doesn’t change because we know their blood pressure
It’s our response that makes all the difference
![Page 7: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/7.jpg)
Our nation has moved from a model of education reform that focused on fixing schools to a model that is focused on fixing the teaching profession
Data Use in Teacher Evaluation is our construct for today
![Page 8: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/8.jpg)
Be considerate of the continuum of stakes involved
Support
Compensate
Terminate
Increasing levels of required rigor
Incr
easi
ng r
isk
![Page 9: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/9.jpg)
• Growth• Depiction of progress over time along a cross-
grade scale• Value-Added
– A determination of whether growth is greater for a particular student or group of students than would be expected
Let’s get clear on terms
![Page 10: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/10.jpg)
Marcus Normal Growth Needed Growth
Marcus’ growth
College readiness standard
![Page 11: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/11.jpg)
Is the progress produced by this teacher dramatically different than teaching peers who deliver instruction to comparable students in comparable situations?
What question is being answered in support of
using data in evaluating teachers?
![Page 12: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/12.jpg)
The Test
The Growth Metric
The Evaluation
The Rating
There are four key steps required to answer this question
![Page 13: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/13.jpg)
The purpose and design of the instrument is significant
• Many assessments are not designed to measure growth
• Others do not measure growth equally well for all students
![Page 14: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/14.jpg)
Both Status and Growth are important
Beginning
Literacy
Adult Reading
5th Grade x
x
Time 1 Time 2
StatusGrowth
Value Added = Teacher Contribution to Growth
![Page 15: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/15.jpg)
Teachers encounter a distribution of student performance
Beginning
Literacy
Adult Reading
5th Grad
e
x x xx
xx
xx
x
x
xx
x
xx
Grade Level Performance
Norm = “Typical” for a reference population
![Page 16: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/16.jpg)
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Traditional Assessment Item Bank
![Page 17: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/17.jpg)
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Grade Level StandardsOverlap allows linking and scale construction
Grade Level Standards
![Page 18: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/18.jpg)
Adaptive testing works differently
Item bank can span full range of achievement
![Page 19: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/19.jpg)
Available item pool depthis crucial
Est. RIT
Correct
Incorrect
![Page 20: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/20.jpg)
Tests are not equally accurate for all students
California STAR NWEA MAP
![Page 21: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/21.jpg)
5th Grade Level Items
These differences impact measurement error
.00
.02
.04
.06
.08
.10
.12
Info
rmat
ion
165 175 185 195 205 215 225 235 245
Scale Score
Academic Warning Below Meets Exceeds
Adaptive Test
Traditional Test
Significantly Different
Error
1st 86th
![Page 22: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/22.jpg)
• Think of a high stakes test – State Summative
– Designed to identify if a student is proficient or not
• Do they do that well?• 93% correct on Proficiency determination
• Does it go off design well?• 75% correct on Performance Levels
determination
Error can change your life!
*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May 2004, http://dspace.udel.edu:8080/dspace/handle/19716/244
![Page 23: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/23.jpg)
• Assessments must align with the teacher’s instructional responsibility– Validity
• Is it assessing what you think it’s assessing?– Reliability
• If we gave it again, would the results be consistent?
What is measured must be aligned to what is being taught
![Page 24: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/24.jpg)
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• …when science is defined in terms of knowledge of facts that are taught in school…(then) those students who have been taught the facts will know them, and those who have not will…not. A test that assesses these skills is likely to be highly sensitive to instruction.
The instrument must be able to detect instruction
![Page 25: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/25.jpg)
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• When ability in science is defined in terms of scientific reasoning…achievement will be less closely tied to age and exposure, and more closely related to general intelligence. In other words, science reasoning tasks are relatively insensitive to instruction.
The more complex, the harder to detect and attribute to one teacher
![Page 26: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/26.jpg)
• Security and Cheating
• Proctoring
• Procedures
Other issues
![Page 27: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/27.jpg)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Spring term Fall term
Mean spring and fall test duration in minutes by school
Dur
atio
n (M
in)
![Page 28: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/28.jpg)
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
8.00
10.00
Students taking 10+ minutes longer spring than fall All other students
Ten minutes makes a difference ~ one RIT
Gro
wth
In
dex
(RIT
)
![Page 29: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/29.jpg)
Testing is complete . . . What is useful to answer our question?
The Test
The Growth Metric
The Evaluation
The Rating
![Page 30: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/30.jpg)
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80
10
20
30
40
50
60
70
80
90
100
MathReading
The metric matters - Let’s go underneath “Proficiency”
Difficulty of New York “Meets” Level
Nat
iona
l Per
cent
ile
College Readiness
Typical
![Page 31: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/31.jpg)
Mathematics
No ChangeDownUp
Fall RIT
Num
ber o
f Stu
dent
sWhat gets measured and attended to
really does matter
Proficiency College Readiness
One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores
![Page 32: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/32.jpg)
Mathematics
Below projected growthMet or above pro-jected growth
Student’s score in fall
Nu
mb
er o
f S
tud
ents
Number of 5th grade students meeting projected mathemat-ics growth in the same district
Changing from Proficiency to Growth means all kids matter
![Page 33: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/33.jpg)
How can we make it fair?
The Test
The Growth Metric
The Evaluation
The Rating
![Page 34: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/34.jpg)
• What if I skip this step?– Comparison is likely against normative data so the
comparison is to “typical kids in typical settings”• How fair is it to disregard context?
– Good teacher – bad school– Good teacher – challenging kids
How does your performance evaluation consider context?
Consider . . .
![Page 35: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/35.jpg)
• Value added models control for a variety of classroom, school level, and other conditions– Over one hundred different value added models– All attempt to minimize error– Variables outside controls are assumed as random
• Results are not stable– The use of multiple-years of data is highly
recommended– Results are more likely to be stable at the
extremes
Nothing is perfect
![Page 36: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/36.jpg)
Multiple years of data is necessary for some stability
Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)
Lowest Highest0
20
40
60
80
100
120
Year 1Year 2
Teachers with growth scores in lowest and highest quintile over two years using NWEA’s MAP
(493 teachers)
Vote – Year 2 above or below
Num
ber
of te
ache
rs
![Page 37: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/37.jpg)
• Control for statistical error– All models attempt to
address this issue• Error is compounded with
combining two test events
– Nevertheless, many teachers’ value-added scores will fall within the range of statistical error
A variety of errors mean more stability only at the
extremes
![Page 38: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/38.jpg)
-12.00-11.00-10.00
-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00
10.0011.0012.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
Aver
age
Grow
th In
dex
Scor
e an
d Ra
nge
Q5
Q4
Q3
Q2
Q1
Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.
Range of teacher value-added estimates
![Page 39: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/39.jpg)
With one teacher, error means a lot
![Page 40: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/40.jpg)
• Value-added models assume that variation is caused by randomness if not controlled for explicitly– Young teachers are assigned disproportionate
numbers of students with poor discipline records– Parent requests for the “best” teachers are
honored– Sound educational reasons for placement are
likely to be defensible
Assumption of randomness can have risk
implications
![Page 41: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/41.jpg)
• Idiosyncratic cases– In self-contained classrooms,
one or two idiosyncratic cases can have a large effect on results
Lower numbers can significantly impact a teacher
level analysis
![Page 42: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/42.jpg)
How tests are used to evaluate teachers
The Test
The Growth Metric
The Evaluation
The Rating
![Page 43: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/43.jpg)
• How would you translate a rank order to a rating?• Data can be provided
• Value judgment ultimately used to set cut scores for points or rating
Translation into ratings can be difficult to inform with data
![Page 44: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/44.jpg)
Decisions are value based, not empirical
• What is far below a district’s expectation is subjective
• What about• Obligation to help
teachers improve?• Quality of replacement
teachers?
![Page 45: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/45.jpg)
• System for combining elements and producing a rating is also a value based decision– Multiple measures and principal judgment must be
included– Evaluate the extremes to make sure it makes sense
Even multiple measures need to be used well
![Page 46: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/46.jpg)
• Principal evaluation, state test, and local assessment scores are combined– Rating and points generated separately for each category– Principal has 60% of the evaluation
• What happens at the extremes– Low end of Developing (not Ineffective) with test scores
requires 98% rating by principal to not fall to Ineffective• Effective needs 95%
– A highly effective teacher based on test scores needs 50% or higher on Principal evaluation to maintain rating
NY use of multiple measures provides an example
![Page 47: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/47.jpg)
• Be thoughtful • Involve variety of stakeholders • Use multiple years of student achievement data• Begin with pilots to understand the accuracy and
unintended consequences• Embrace the formative advantages of growth
measurement as well as the summative
Recommendations
![Page 48: Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame](https://reader034.fdocuments.net/reader034/viewer/2022051609/547b1c59b4af9fb4588b457b/html5/thumbnails/48.jpg)
• Presentations and other recommended resources are available at: – www.nwea.org– www.kingsburycenter.org
• Contacting us:NWEA Main Number 503-624-1951 E-mail: [email protected]
More information