description: tags: REL 2007039

download description: tags: REL 2007039

of 47

Transcript of description: tags: REL 2007039

  • 8/14/2019 description: tags: REL 2007039

    1/47

    I S S U E S& AN SWER S

    U . S . D e p a r t m e n t o f E d u c a t i o n

    Measuring howbenchmark

    assessmentsaf fec t studentachievement

    R E L 2 0 0 7 N o . 0 3 9

    At Education DevelopmentCenter, Inc.

  • 8/14/2019 description: tags: REL 2007039

    2/47

    Measuring how benchmark assessments

    afect student achievementDecember 2007

    Prepared by

    Susan HendersonWestEd

    Anthony PetrosinoWestEd

    Sarah GuckenburgWestEd

    Stephen HamiltonWestEd

    At Education DevelopmentCenter, Inc.

    I S S U E S&ANSWERS R E L 2 0 0 7 N o . 0 3 9

    U . S . D e p a r t m e n t o f E d u c a t i o n

  • 8/14/2019 description: tags: REL 2007039

    3/47

    Issues & Answers is an ongoing series o reports rom short-term Fast Response Projects conducted by the regional educa-

    tional laboratories on current education issues o importance at local, state, and regional levels. Fast Response Project topicschange to reect new issues, as identied through lab outreach and requests or assistance rom policymakers and educa-tors at state and local levels and rom communities, businesses, parents, amilies, and youth. All Issues & Answers reportsmeet Institute o Education Sciences standards or scientically valid research.

    December 2007

    Tis report was prepared or the Institute o Education Sciences (IES) under Contract ED-06-CO-0025 by Regional Educa-tional Laboratory Northeast and Islands administered by Education Development Center, Inc. Te content o the publica-tion does not necessarily reect the views or policies o IES or the U.S. Department o Education nor does mention o tradenames, commercial products, or organizations imply endorsement by the U.S. Government.

    Tis report is in the public domain. While permission to reprint this publication is not necessary, it should be cited as:

    Henderson, S., Petrosino, A., Guckenburg, S., & Hamilton, S. (2007).Measuring how benchmark assessments aect studentachievement(Issues & Answers Report, REL 2007No. 09). Washington, DC: U.S. Department o Education, Institute oEducation Sciences, National Center or Education Evaluation and Regional Assistance, Regional Educational LaboratoryNortheast and Islands. Retrieved rom http:ies.ed.govnceeedlabs

    Tis report is available on the regional educational laboratory web site at http:ies.ed.govnceeedlabs.

    WA

    OR

    ID

    MT

    NV

    CAUT

    AZ

    WY

    ND

    SD

    NE

    KS

    CO

    NM

    TX

    OK

    CO

    AR

    LA

    MSAL GA

    SC

    NC

    VAWV

    KY

    TN

    PA

    NY

    FL

    AK

    MN

    WI

    IA

    IL IN

    MI

    OH

    VT

    NH

    ME

    CTRI

    MA

    MO

    VI

    PRAt Education Development

    Center, Inc.

  • 8/14/2019 description: tags: REL 2007039

    4/47

    Summary

    This report examines a Massachusetts

    pilot program or quarterly benchmark

    exams in middle-school mathematics,

    nding that program schools do not

    show greater gains in student achieve-

    ment ater a year. But that nding might

    reect limited data rather than inefec-

    tive benchmark assessments.

    Benchmark assessments are used in manydistricts throughout the nation to raise stu-dent, school, and district achievement and tomeet the requirements o the No Child LeBehind Act o 2001. Tis report details a study

    using a quasi-experimental design to examinewhether schools using quarterly benchmarkexams in middle-school mathematics undera Massachusetts pilot program show greatergains in student achievement than schools notin the program.

    o measure the eects o benchmark assess-ments, the study matched comparisonschools to the 22 schools in the Massachusetts

    pilot program on pre-implementation testscores and other variables. It examined de-scriptive statistics on the data and perormedinterrupted time series analysis to test causalinerences.

    Te study ound no immediate statistically sig-nicant or substantively important dierence

    between the program and comparison schools.Tat nding might, however, reect limita-tions in the data rather than the ineective-ness o benchmark assessments.

    First, data are lacking on what benchmarkassessment practices comparison schools maybe using, because the study examined theimpact o a particular structured benchmark-ing program. More than 70 percent o districtsare doing some type o ormative assess-ment, so it is possible that at least some o thecomparison schools implemented their ownversion o benchmarking. Second, the study

    was underpowered. Tat means that a smallbut important treatment eect or benchmark-ing could have gone undetected because therewere only 22 program schools and com-parison schools. Tird, with only one year opost-implementation data, it may be too earlyto observe any impact rom the intervention inthe program schools.

    Although the study did not nd any imme-

    diate dierence between schools employingbenchmark assessments and those not doingso, it provides initial empirical data to inormstate and local education agencies.

    Te report urges that researchers and policy-makers continue to track achievement datain the program and comparison schools, to

    M m mt

    t tdt mt

  • 8/14/2019 description: tags: REL 2007039

    5/47

    Summary

    reassess the initial ndings in uture years,and to provide additional data to local andstate decisionmakers about the impact o thisbenchmark assessment practice.

    Using student-level data rather than school-level data might help researchers examine theimpact o benchmark assessments on impor-tant No Child Le Behind subgroups (such asminority students or students with disabili-ties). Some nontrivial eects or subgroupsmight be masked by comparing school meanscores. (At the onset o the study, only school-level data were available to researchers.)

    Another useul ollow-up would be disag-gregating the school achievement data bymathematics content strand to see i there areany eects in particular standards. Becausethe quarterly assessments are broken out bymathematics content strand, doing so wouldconnect logically with the benchmark assess-ment strategy. Tis rened data analysis might

    be more sensitive to the intervention andmight also be linked to inormation providedto the Massachusetts Department o Educationabout which content strands schools ocused

    on in their benchmark assessments.

    Conversations with education decision-makers support what seems to be commonsense. Higher mathematics scores will comenot because benchmarks exist but becauseo how a schools teachers and leaders usethe assessment data. Tis kind o ollow-upresearch, though dicult, is imperative tobetter understand the impact o benchmark

    assessments. A possible approach is to exam-ine initial district progress reports or insightinto school buy-in to the initiative, quality oleadership, challenges to implementation, par-ticular standards that participating districtsocus on, and how schools use the benchmarkassessment data.

    December 2007

  • 8/14/2019 description: tags: REL 2007039

    6/47

    Table o conTenTs

    Overview 1

    Data on the eectiveness o benchmark assessments are limited 1

    Few eects rom benchmark assessments are evident aer one program year 3. . . using descriptive statistics 5. . . or interrupted time series analysis 6

    Why werent eects evident aer the frst program year? 7

    How to better understand the eects o benchmark assessments 7

    Appendix A Methodology 9

    Appendix B Construction o the study database 13

    Appendix C Identifcation o comparison schools 15

    Appendix D Interrupted time series analysis 22

    Appendix E Massachusetts Curriculum Frameworks or grade 8 mathematics (May 2004) 26

    Notes 36

    Reerences 40

    Boxes

    1 Key terms used in the report 2

    2 Methodology

    Figures

    1 Scaled eighth-grade mathematics scores or program and comparison schools in the MassachusettsComprehensive Assessment System, 200106 6

    2 Raw eighth-grade mathematics scores or program and comparison schools in the MassachusettsComprehensive Assessment System, 200106 6

    Tables

    1 Scaled eighth-grade mathematics scores or program and comparison schools in the MassachusettsComprehensive Assessment System, 200106 5

    C1 Percentage o low-income students and Socio-Demographic Composite Index or ve selected schools 16

    C2 2005 Composite Perormance Index mathematics score, Socio-Demographic Composite Index, andadjusted academic score or ve selected schools 16

    C3 Socio-Demographic Composite Index, adjusted academic score, and Mahalanobis Distance score or veselected schools 17

    C4 Comparison o means and medians o initial program and comparison schools 18

  • 8/14/2019 description: tags: REL 2007039

    7/47

    C5 -test or dierences in pretest mathematics scores between initial program and comparisonschools, 200105 19

    C6 -test or dierences in pretest scaled mathematics scores, 200105 (Mahapick sample) 20

    C7 Comparison o means and medians o nal program and comparison schools 21

    D1 Variables used in the analysis 2

    D2 Baseline mean model, program schools only (N=22 schools, 115 observations) 2

    D3 Baseline mean model, comparison schools only (N= schools, 20 observations) 2

    D4 Baseline mean model, dierence-in-dierence estimate (N=66 schools, 5 observations) 2

    D5 Baseline mean model, dierence-in-dierence estimate, with covariates (N=66 schools, 5 observations) 2

    D6 Linear trend model, program schools only (N=22 schools, 115 observations) 2

    D7 Linear trend model, comparison schools only (N= schools, 20 observations) 2

    D8 Linear trend model, dierence-in-dierence estimate (N=66 schools, 5 observations) 25

    D9 Linear trend model, dierence-in-dierence estimate, with covariates (N=66 schools, 5 observations) 25

    E1 Scaled score ranges o the Massachusetts Comprehensive Assessment System by perormance leveland year 5

  • 8/14/2019 description: tags: REL 2007039

    8/47

    Overview 1

    T tm

    Mttt m qtm m mdd-mtmt,d ttm dt t tdtmt t

    . bt ttd mtft mtddt tt tmmt.

    overview

    Benchmark assessments are used in manydistricts throughout the United States to raise

    student, school, and district achievement andto meet the requirements o the No Child LeBehind Act o 2001 (see box 1 on key terms). Tisreport details a study using a quasi-experimentaldesign to examine whether schools using quarterly

    benchmark exams in middle-school mathematicsunder a Massachusetts pilot program show greatergains in student achievement than schools not inthe program.

    o measure the eects o benchmark assessments,the study matched comparison schools to the22 program schools in the Massachusetts pilotprogram on pre-implementation test scores andother variables. It examined descriptive statisticson the data and perormed interrupted time series

    analysis to test causal inerences.

    Te study ound no immediate statisticallysignicant or substantively important dierencebetween the program and comparison schoolsa year aer the pilot began. Tat nding might,however, reect limitations in the data rather thanthe ineectiveness o benchmark assessments.

    DaTa on The eecTiveness o

    benchMark assessMenTs are liMiTeD

    Benchmark assessments align with state stan-dards, are generally administered three or ourtimes a year, and provide educators and ad-ministrators with immediate student-level dataconnected to individual standards and contentstrands (Herman & Baker, 2005; Olson, 2005).Benchmark assessments are generally regarded asa promising practice. A U.S. Department o Educa-tion report (2007) notes that regardless o their

    specic mathematics programs, No Child LeBehind Blue Ribbon Schools . . . [all] emphasizealignment o the schools mathematics curriculumwith state standards and conduct requent bench-mark assessments to determine student mastery othe standards.

    By providing timely inormation to educa-tors about student growth on standards, such

  • 8/14/2019 description: tags: REL 2007039

    9/47

    2 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    assessments allow instructional practices to bemodied to better meet student needs. Benchmarkassessments ll a gap le by annual state tests,which oen provide data only months aer theyare administered and whose purpose is largelysummative (Herman & Baker, 2005; Olson, 2005).A 2005Education Week survey o superintendentsound that approximately 70 percent reported

    using benchmark assessments in their districts(Olson). But there is little empirical evidence to de-termine whether and to what extent these alignedbenchmark assessments aect student outcomes.Tis report provides evidence on the impact o aMassachusetts Department o Education bench-mark assessment initiative targeting high-povertymiddle schools.

    bOx 1

    Key terms used in the report

    Benchmark assessment. A bench-mark assessment is an interim as-

    sessment created by districts that canbe used both ormatively and sum-matively. It provides local account-ability data on identied learningstandards or district review aer adened instructional period and pro-vides teachers with student outcomedata to inorm instructional practiceand intervention beore annual statesummative assessments. In addi-tion, a benchmark assessment allowseducators to monitor the progress ostudents against the state standardsand to predict perormance on stateexams.

    Content strand. Te MassachusettsCurriculum Frameworks containve content strands that are assessedthrough the Massachusetts Compre-hensive Assessment System: num-ber sense and operation; patterns,relations, and algebra; geometry;measurement; and data analysis,statistics, and probability.

    Eect size. An eect size o 0.0means that the experimental groupis perorming, on average, about 0.0o a standard deviation better thanthe comparison group (Valentineand Cooper, 200). An eect size o

    0.0 represents a roughly 20 percentimprovement over the comparisongroup.

    Formative assessment. In this study

    a ormative assessment is an assess-ment whose data are used to inorminstructional practice within a cycleo learning or the students assessed.In September 2007 the FormativeAssessment or Students and eachersstudy group o the Council o ChieState School Ocers Assessment orLearning urther rened the denitiono ormative assessment as a processused by teachers and students duringinstruction that provides eedback toadjust ongoing teaching and learningto improve students achievement ointended instructional outcomes (seehttp:www.ccsso.orgprojectsscassProjectsFormative_Assessment_or_Students_and_eachers).

    Interrupted time series analysis. Aninterrupted time series analysis is aseries o observations made on one ormore variables over time beore andaer the implementation o a pro-gram or treatment (Shadish, Cook, &Campbell, 2002).

    Quasi-experimental design. A quasi-experimental design is an experimen-tal design where units o study arenot assigned to conditions randomly(Shadish, Cook, & Campbell, 2002).

    Scaled scores. Scaled scores are con-structed by converting students rawscores (say, the number o questionscorrect) on a test to yield comparableresults across students, test versions,

    or time.

    Statistical power. Statistical powerreers to the ability o the statisticaltest to detect a true treatment eect,i one exists. Although there are otherdesign eatures that can inuence thestatistical power o a test, researchersare generally most concerned withsample size, because it is the compo-nent they have the most control overand can normally plan or.

    Summative assessment. A summativeassessment is designed to show theextent to which students understandthe skills, objectives, and content oa program o study. Te assessmentsare administered aer the oppor-tunity to learn subject matter hasended, such as at the end o a course,semester, or grade.

    Underpowered study. A study isconsidered underpowered i, allelse being equal, it lacks a sucientsample size to detect a small butnontrivial treatment eect. Anunderpowered study would lead re-searchers to report that such a smallbut nontrivial dierence was notstatistically signicant.

  • 8/14/2019 description: tags: REL 2007039

    10/47

    ew eectS rOm benchmark aSSeSSmentS are evident ater One prOgram year 3

    Studies o benchmark assessments eects onstudent outcomes are ew. But the substantialliterature on the eects o ormative assessmentsmore generally points consistently to the positiveeects o ormative assessment on student learn-

    ing (Black & Wiliam, 1998a, 1998b; Bloom, 198).Reviewing 250 studies o classroom ormativeassessments, Black and Wiliam (1998a, 1998b)nd that ormative assessments, broadly dened,are positively correlated with student learning,boosting perormance 200 percent over thato comparison groups (with eect sizes rom0.0 to 0.70).1 Black and Wiliam note that thesepositive eects are even larger or low-achievingstudents than or the general student population.Other studies indicate that ormative assessments

    can support students and teachers in identiyinglearning goals and the instructional strategiesto achieve them (Boston, 2002). Whether thesetrends hold or benchmark assessments, however,has yet to be shown.

    Making this report particularly timely are thewidespread interest in the Northeast and IslandsRegion in ormative assessment and systemsto support it and the piloting o a benchmarkassessment approach to mathematics in Massa-

    chusetts middle schools. State education agen-cies in New York, Vermont, and Connecticutare also working with ederal assessment andaccountability centers and regional comprehen-sive centers to pilot ormative and benchmarkassessment practices in select districts. And thelarge nancial investment required or the datamanagement systems to support this comprehen-sive approach underscores the need or indepen-dent data to inorm state and district investmentdecisions.

    Te 2005 Massachusetts Comprehensive SchoolReorm and the echnology EnhancementCompetitive grant programs include prioritiesor participating schools and districts to developand use benchmark assessments. As a result,eight Massachusetts school districts use a datamanagement system supported by Assessmentechnologies Incorporated to develop their own

    grade-level benchmarkassessments in math-ematics or about 10,000middle-school studentsin 25 schools. Te deci-

    sion o the MassachusettsDepartment o Educationto support the develop-ment o mathematicsbenchmark assessmentsin a limited number omiddle schools providedan opportunity to studythe eects on student achievement.

    Tis report details a study on whether schools

    using quarterly benchmark exams in middle-school mathematics under the Massachusetts pilotprogram show greater gains in student achieve-ment aer one year than schools not in the pro-gram. Te study looked at comparison schoolsand 22 program schools using quarterly bench-mark assessments aligned with MassachusettsCurriculum Frameworks Standards or mathemat-ics in grade 8, with student achievement measuredby the Massachusetts Comprehensive AssessmentSystem (MCAS).

    ew eecTs roM benchMark assessMenTs

    are eviDenT aTer one prograM year

    Te study was designed to determine whetherthere was any immediate, discernible eect oneighth-grade mathematics achievement romusing benchmark assessments in middle schoolsreceiving the Comprehensive School Reormgrants. An advantage o the studys achievement

    data was that they went beyond a single pretestyear and included scores rom ve prior annualadministrations o the MCAS, yielding ve pre-implementation years or eighth-grade mathemat-ics scores. A disadvantage o the data was that theycontain only one post-test year.2 Even so, the datacould show whether there was any perceptible, im-mediate increase or decrease in scores due to theimplementation o benchmark assessments.

    T td

    m m

    t tdt

    tm, t t

    tt tt

    t t

    mt m

    m t

    tt t t

    t t

    tdt

  • 8/14/2019 description: tags: REL 2007039

    11/47

    4 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    bOx 2

    Methodology

    A quasi-experimental design, withprogram and matched comparison

    schools, was used to examine whetherschools using quarterly benchmarkexams in middle-school mathematicsunder the Massachusetts pilot pro-gram showed greater gains in studentachievement in mathematics peror-mance aer one year than schools notin the program. Analyses were basedon (mostly) publicly available,1 schoolachievement and demographic datamaintained by the Massachusetts De-partment o Education. Te primaryoutcome measure was eighth-grademathematics achievement, as assessedby the Massachusetts ComprehensiveAssessment System (MCAS).

    Defning the program

    Te study dened benchmark as-sessments as assessments that alignwith the Massachusetts CurriculumFrameworks Standards, are adminis-tered quarterly at the school level, andyield student-level dataimmediatelyavailable to school educators andadministratorsaligned with individ-ual standards and content strands. Forthe benchmark assessment initiativeexamined in the report, the Massachu-setts Department o Education selectedhigh-poverty middle schools underpressure to signicantly improve theirstudents mathematics achievement,choosing 25 schools in eight districts toparticipate in the pilot initiative.

    Constructing the study database

    and describing the variables

    Data were collected rom student- orschool-level achievement and de-mographic data maintained by the

    Massachusetts Department o Educa-tion.2 Te outcome variable was scaledeighth-grade MCAS mathematicsscores over 200106. Te MCAS, whichullls the requirements o the No

    Child Le Behind Act o 2001 requir-ing annual assessments in reading andmathematics or students in grades8 and in high school, tests all publicschool students in Massachusetts.

    Other variables gathered or the studyincluded the school name, location,grade structure, and enrollment; therace and ethnicity o students; andthe proportion o limited Englishprociency and low-income students.

    Creating a comparison group

    Only a well implemented random-ization procedure controls or bothknown and unknown actors thatcould inuence or bias the ndings.But because the grants to implementthe benchmark assessments werealready distributed and the programwas already administered to schools,random assignment was not possible.So, it was necessary to use other pro-cedures to create a counteractualaset o schools that did not receive theprogram.

    Te study used covariate matchingto create a set o comparison schoolsthat was as similar as possible to theprogram schools (in the aggregate) onthe chosen actors, meaning that anyndings, whether positive or nega-tive, would be unlikely to have beeninuenced by those actors. Tesevariables included enrollment, per-centage o students classied as lowincome, percentage o students classi-ed as English language learners, andpercentage o students categorized in

    dierent ethnic groups. Also includedwere each schools eighth-grade base-line (or pretest) mathematics score(based on an average o its 20005eighth-grade mathematics scores)

    and the type o location it served.

    Prior research guided the selection othe variables used as covariates in thematching. Bloom (200) suggests thatpretest scores are perhaps the mostimportant variable to use in a match-ing procedure. Tere is also substan-tial research that identies large gapsin academic achievement or racialminorities (Jencks & Phillips, 1998),low-income students (Hannaway,2005), and English language learners(Abedi & Gandara, 2006). Althoughthe research on the relationshipbetween school size and academicachievement is somewhat conict-ing (Cotton, 1996), the variability inschool size resulted in total enroll-ment in the middle school beingincluded in the matching procedure.

    Te eligibility pool or the compari-son matches included the 89 Mas-sachusetts middle schools that didnot receive the Comprehensive SchoolReorm grants. Statistical procedureswere used to identiy the two bestmatches or each program schoolrom the eligibility pool. Te covariatematching resulted in a nal sample o22 program schools and compari-son schools that were nearly identicalon pretest academic scores. Te projectdesign achieved balance on nearly allschool-level social and demographiccharacteristics, except that there werelarger shares o Arican American andPacic Islander students in programschools. Tese dierences were con-trolled or statistically in the outcome

  • 8/14/2019 description: tags: REL 2007039

    12/47

    ew eectS rOm benchmark aSSeSSmentS are evident ater One prOgram year 5

    o attribute changes to benchmark assessments,more inormation than pretest and post-test scoresrom program schools was needed. Did similar

    schools, not implementing the benchmarkingpractice, are better or worse than the 22 programschools? It could be that the program schools im-proved slightly, but that similar schools not imple-menting the benchmark assessment practice didmuch better or much worse. So achievement datawere also examined rom comparison schoolsaset o similar Massachusetts schools that did notimplement the program (or details on methodol-ogy, see box 2 and appendixes A and B; or detailson selecting comparison schools, see box 2 and

    appendix C).

    Researchers developed a set o comparisonmiddle schools in Massachusetts that were verysimilar to the program schools (in the aggregate)on a number o variables. Most important, thecomparison schools were nearly identical to theprogram schools on the pre-implementationscores. Te comparison schools thus providedan opportunity to track the movement o eighth-grade mathematics scores over the period in the

    absence o the program.

    . . . using descriptive statistics

    Scaled scores or program and comparison schoolsrom 2001 to 2006 did not show a large changein eighth-grade MCAS scores or either programor comparison schools (table 1). Note that scaledscores or both groups were distributed in the

    MCAS needs improvement category or allyearsurther evidence to support the validity othe matching procedure.

    Tere appeared to be a very slight uptick in eighth-grade mathematics outcomes aer the intervention

    in 2006. Tere was, however, a similar increasein 200, beore the intervention. And trends weresimilar or the program and comparison groups. Inboth, there was a very slight increase on the outcomemeasure, but similar increases occurred beore the2006 intervention (gure 1). So, the descriptive sta-tistics showed no perceptible dierence between the22 program schools and the comparison schoolson their 2006 eighth-grade mathematics outcomes.

    analysis, with no change in the results(see appendix D).

    Analyzing the data

    Aer matching, descriptive statis-

    tics were used to examine the meanscores or all ve pre-implementationyears and one post-implementationyear or the program and comparisonschools. A comparative interruptedtime series analysis was also used

    to more rigorously assess whetherthere was a statistically signicantdierence between the programand comparison schools in changesin mathematics perormance (see

    Bloom, 200; Cook & Campbell,1979). Te interrupted time seriesdesign was meant to determinewhether there was any change in thetrend because o the interruption(program implementation).

    Notes

    Te 20010 achievement data were1.not publicly available and had to berequested rom the MassachusettsDepartment o Education.Student level data or 20010 had2.to be aggregated at the school level.Te 20010 achievement data wereprovided by the Massachusetts Depar t-ment o Education at the student level,but were not linked to the student leveldemographic data or the same years.

    table 1

    sd t-d mtmt m d m t

    Mtt cm amtstm, 200106

    y po soos coso soos

    2001 224.80 226.31

    2002 223.21 223.28

    2003 224.81 224.09

    2004 226.10 225.32

    2005 225.62 225.23

    2006 226.98 226.18

    Note: Scaled scores are constructed by converting students raw scores

    (say, the number o questions answered corrrectly) on a test to yield

    comparable results across students, test versions, or time. Scores orboth groups are distributed in the Massachusetts Comprehensive As-

    sessment System needs improvement category or all years.

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    13/47

    6 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    Te study also examined raw scores because it ispossible that scaling test scores could mask eectsover time. Te range in raw scores was larger, and

    scores trended sharply higher in 2006. But againboth program and comparison schools showed asimilar trend, more sharply upward than that othe scaled scores (gure 2).

    . . . or interrupted time series analysis

    Relying on such strategies alone was not adequateto rigorously assess the impact o benchmarkassessment. o assess the dierences betweenprogram and comparison schools in changes in

    mathematics perormance, the study used inter-rupted time series analysis, which established thepre-intervention trend in student perormance andanalyzed the post-intervention data to determinewhether there was a departure rom that trend(Bloom, 200; see appendix D or detai ls). Fiveyears o annual pre-implementation data and ayear o post-implementation data ormed the timeseries. Te program schools implementation o

    the benchmark assessment practice in 2006 wasthe intervention, or interruption.

    Tere was a small but statistically signicantincrease in the program schools in 2006. Teprogram schools had slightly higher mean eighth-

    grade mathematics scores than what would havebeen expected without the program. But this small,statistically signicant increase also occurred inthe comparison schools, where mean mathematicsscores were slightly above the predicted trend.

    Dierence-in-dierence analysis underscoredthe similarity between the groups. Te programeect was about 0.8 o a mathematics test point(see appendix D, table D), but it was not statisti-cally signicant. Te most likely interpretation is

    that the achievement o both groups was slightlyincreasing and that the dierence between themcould have been due to chance rather than to anyprogram eect. So, though both groups o schoolssaw similar, (slightly) higher than expected in-creases in their eighth-grade mathematics scaledscores in 2006, the small increase or the programschools cannot be attributed to the benchmarkassessments.

    igure 1

    sd t-d mtmt

    m d m tMtt cm amt

    stm, 200106

    Note: Scaled scores are constructed by converting students raw scores

    (say, the number o questions correct) on a test in order to yield compa-

    rable results across students, test versions, or time.

    Source: Authors analysis based on data described in text.

    220

    221

    222

    223

    224

    225

    226

    227

    228

    229

    230

    200620052004200320022001

    Score rangein needs

    improvement

    Program

    Comparison

    Intervention

    igure 2

    r t-d mtmt

    m d m tMtt cm amt

    stm, 200106

    Source: Authors analysis based on data described in text.

    20

    22

    24

    26

    28

    30

    200620052004200320022001

    Raw scores

    Program

    Intervention

    Comparison

  • 8/14/2019 description: tags: REL 2007039

    14/47

    hOw tO better underStand the eectS O benchmark aSSeSSmentS 7

    why werenT eecTs eviDenT

    aTer The irsT prograM year?

    Te study ound no statistically signicant or sub-stantively important dierence between schools in

    their rst year implementing quarterly benchmarkexams in middle-school mathematics and thosenot employing the practice. Why? Te ndingmight be because o limitations in the data ratherthan the ineectiveness o benchmark assessments.

    First, data are lacking on what benchmark assess-ment practices comparison schools may be using,because the study examined the impact o a par-ticular structured benchmarking program. Morethan 70 percent o districts are doing some type o

    ormative assessment (Olson, 2005), so it is possiblethat at least some o the comparison schools imple-mented their own version o benchmarking. Giventhe prevalence o ormative assessments under theNo Child Le Behind Act, it is highly unlikely thata project with strict ly controlled conditions couldbe implemented (that is, with schools using no or-mative assessment at all as the comparison group).

    Second, the study was underpowered. Tat meansthat a small but important treatment eect or

    benchmarking could have gone undetected be-cause there were only 22 program schools and comparison schools.5 Unortunately, the samplesize or program schools could not be increasedbecause only 25 schools in the eight districts ini-tially received the state grants (three schools werelater dropped). Increasing the comparison schoolsample alone (rom to 66, or example) wouldhave brought little additional power.

    Tird, with only one year o post-implementation

    data, it may be too early to observe any impactrom intervention in the program schools.

    how To beTTer unDersTanD The

    eecTs o benchMark assessMenTs

    Although the study did not nd any immediatedierence between schools employing benchmark

    assessments and those not doing so, the reportprovides initial empirical data to inorm state andlocal education agencies.

    o understand the longer-term eects o bench-

    mark assessments, it would be useul to continueto track achievement data in the program andcomparison schools to reassess the initial nd-ings beyond a single post-intervention year and toprovide additional data to local and state deci-sionmakers about the impact o this benchmarkassessment practice.

    Using student-level datarather than school-leveldata might also help

    researchers examine theimpact o benchmarkassessment on importantNo Child Le Behindsubgroups (such asminority students or stu-dents with disabilities).By comparing schoolmean scores, as in thisstudy, some nontrivial e-ects or subgroups may be masked. At the onset o

    the study, only school-level data were available toresearchers, but since then working relationshipshave been arranged with state education agenciesor specic regional educational laboratory proj-ects to use student-level data.

    Another useul ollow-up would be disaggregat-ing the school achievement data by mathematicscontent strand to see i there are any eects onparticular standards. As the quarterly assess-ments are broken out by mathematics content

    strand, doing so would connect logically with thebenchmark assessment strategy. Such an ap-proach could determine whether the interventionhas aected particular subscales o mathematicsin the Massachusetts Curriculum Frameworks.Tis more rened outcome data may be moresensitive to the intervention and might also pro-vide inormation to the Massachusetts Depart-ment o Education about which content strands

    T dtd t

    -tm t

    m m

    t d

    t t t t

    mt dt

    t m d

    m

    t t t

    d d

    t-tt

  • 8/14/2019 description: tags: REL 2007039

    15/47

    8 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    schools ocused on in theirbenchmark assessments.

    Conversations with education de-cisionmakers support what seems

    to be common sense. Highermathematics scores will come notbecause benchmarks exist butbecause o how the benchmarkassessment data are used by a

    schools teachers and leaders. Tis kind o ollow-up research is imperative to better understand theimpact o benchmark assessment.

    But the data sources to identiy successulimplementation in a ast-response project can be

    elusive. A possible solution is to examine initialdistrict progress reports to the Massachusettsgrant program. Tese data may provide insightinto school buy-in to the initiative, qualityo leadership, challenges to implementation,

    particular standards that participating districtsocus on, and how schools use the benchmark as-sessment data. Researchers may ask whether andhow the teachers and administrators used thebenchmark data in instruction and whether andhow intervention strategies were implemented orstudents not perorming well on the benchmarkexams. Based on the availability and qualityo the data, the methodology or determiningthe impact o the intervention could be urtherrened.

    h mtmt

    m t

    m

    t t

    t m

    mt dt

    d

    d d

  • 8/14/2019 description: tags: REL 2007039

    16/47

    appendix a 9

    appenDix a

    MeThoDology

    Tis appendix includes denitions o benchmarkassessments and o the Massachusetts pilot pro-

    gram, an overview o the construction o the studydatabase, the methodology or creating compari-son groups, and a description o the data analysisstrategy. Because implementation o benchmarktesting was at the school level, the unit o analysiswas the school. Choosing that unit also boostedthe statistical power o the study because therewere 25 program schools in the original designrather than eight program districts.6

    A quasi-experimental design, with program and

    matched comparison schools, was used to exam-ine whether schools using quarterly benchmarkexams in middle-school mathematics under theMassachusetts pilot program showed greatergains in student achievement aer a year thanschools not in the program. Te comparisons werebetween program schools and comparison schoolson post-intervention changes in mathematics per-ormance. All the analyses were based on (mostly)publicly available,7 school-level achievement anddemographic data maintained by the Massa-

    chusetts Department o Education. Te primaryoutcome measure was eighth-grade mathematicsachievement, as assessed by the MassachusettsComprehensive Assessment System (MCAS).

    Defning the program

    Te study dened benchmark assessments as as-sessments that align with the Massachusetts Cur-riculum Frameworks Standards, are administeredquarterly at the school level, and yield student-

    level dataquickly available to school-level educa-tors and administratorsconnected to individualstandards and content strands.

    Te study examined a Massachusetts Departmento Education program targeting middle schools. Be-cause what constitutes a middle school diers romtown to town, the study dened middle schools asthose that include seventh and eighth grades. Other

    congurations (say, grades K8, 59, 68, 69, 78,79, or 712) were acceptable, provided that seventhand eighth grades were included.8

    For its benchmark assessment initiative, the

    Massachusetts Department o Education selectedhigh-poverty middle schools under pressure tosignicantly improve their students mathematicsachievement. o select schools, the MassachusettsDepartment o Education issued a request orproposals. Te department prioritized undingor districts (or consortia o districts) with ouror more schools in need o improvement, correc-tive action, or restructuring under the currentadequate yearly progress status model. Te ouror more schools criterion was sometimes relaxed

    during selection. Applications were given prioritybased on the states No Child Le Behind peror-mance rating system:

    Category 1 schools were rated critically lowin mathematics.

    Category 2 schools were rated very low inmathematics and did not meet improvementexpectations or students in the aggregate.

    Category schools were rated very low inmathematics and did not meet improvementexpectations or student subgroups.

    Category schools were rated very low inmathematics and did meet improvementexpectations or all students.

    Category 5 schools were rated low inmathematics and did not meet improvementexpectations or students in the aggregate.

    Te Massachusetts Department o Educationselected 25 schools representing eight districts toparticipate in the pilot initiative. Te selection oprogram schools targeted high-poverty schoolshaving the most diculty in meeting goals orstudent mathematics perormance, introducing aselection bias into the project. Unless importantvariables were controlled or by design and analysis

  • 8/14/2019 description: tags: REL 2007039

    17/47

    10 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    (or example, poverty and pretest or baseline math-ematics scores), any results would be conoundedby pre-existing dierences between schools. In thestudy, balance was achieved between the pro-gram and comparison schools on poverty, pretest

    mathematics scores, and other school-level socialand demographic variables. But because the studywas based on a quasi-experimental design (withoutrandom assignment to conditions), it could notassess whether the participant and comparisongroups were balanced on unobserved actors.

    Constructing the study database

    A master database was developed in SPSS to houseall the necessary data. Data were collected rom

    student- or school-level achievement and demo-graphic data maintained by the MassachusettsDepartment o Education.9 Te outcome variablewas scaled eighth-grade MCAS mathematicsscores or 200106.

    Te MCAS was implemented in response to theMassachusetts Education Reorm Act o 199 andullls the requirements o the ederal No ChildLe Behind Act o 2001, which requires annualassessments in reading and mathematics or

    students in grades 8 and in high school. TeMCAS tests all public school students in Massa-chusetts, including students with disabilities andthose with limited English prociency. Te MCASis administered annually and measures studentperormance on the learning strands in the Mas-sachusetts Curriculum Frameworks (see appendixE). In mathematics these strands include numbersense and operations; patterns, relations, and al-gebra; geometry; measurement; and data analysis,statistics, and probability.

    According to the Massachusetts Department oEducation (2007), the purpose o the MCAS is tohelp educators, students, and parents to:

    Follow student progress.

    Identiy strengths, weaknesses, and gaps incurriculum and instruction.

    Fine-tune curriculum alignment with state-wide standards.

    Gather diagnostic inormation that can beused to improve student perormance.

    Identiy students who may need additionalsupport services or remediation.

    Te MCAS mathematics assessment containsmultiple choice, short-answer, and open responsequestions. Results are reported or individualstudents and districts by our perormance levels:advanced, procient needs improvement, andwarning. Each category corresponds to a scaledscore range (see appendix E, table E1). Although

    the scaled score was the primary outcome variableo interest, the corresponding raw score was alsocollected to determine i scaled scores might havemasked program eects.

    Te MCAS mathematics portion, comprising two60-minute sections, is administered in May ingrades 8. Students completing eighth grade takethe MCAS in the spring o the eighth-grade year.Preliminary results rom the spring administra-tion become available to districts the next August.

    Eighth graders who enter in the 200708 schoolyear, or example, take MCAS mathematics in May2008, and their preliminary results become avail-able in August 2008.

    Other variables gathered or the study included theschool name, grade structure, and enrollment, therace and ethnicity o students, and the proportiono limited English prociency and low-income stu-dents. Demographic data were transormed romtotal numbers to the percentage o students in a

    category enrolled at the school (or example, thosedened as low income). Supplementary geographiclocation data were added rom the National Centeror Educational Statistics, Common Core o Datato identiy school location (urban, rural, and so on).A variable was also created to designate each schoolas a program or comparison school based on theresults o the matching procedure. See appendix Bor the specic steps in constructing the database.

  • 8/14/2019 description: tags: REL 2007039

    18/47

    appendix a 11

    Creating a comparison group

    Only a well implemented randomization pro-cedure controls or both known and unknownactors that could inuence or bias ndings. But

    because the grants to implement benchmarkassessments were already distributed and theprogram was already assigned to schools, randomassignment to conditions was not possible. So, itwas necessary to use other procedures to createa counteractuala set o schools that did notreceive the program.

    Te study used covariate matching to create aset o comparison schools (appendix C detailsthe matching procedure). Using covariates in

    the matching process is a way to control or theinuence o specic actors on the results. Inother words, the comparison schools would beas similar as possible to the program schools (inthe aggregate) on these actors, meaning that anyndings, whether positive or negative, would beunlikely to have been inuenced by these actors.Te variables used in the matching procedure in-cluded a composite index o school-level social anddemographic variables: enrollment, percentage ostudents classied as low income, percentage o

    students classied as English language learners,and percentage o students categorized in dier-ent ethnic groups. Also included in the match-ing procedure were each schools eighth-gradebaseline (or pretest) mathematics score (based onan average o its 20005 eighth-grade mathemat-ics scores) and the type o geographic location theschool served (classied according to the NationalCenter or Education Statistics Common Core oData survey).

    Prior research guided the selection o the variablesused as covariates in the matching. Bloom (200)suggests that pretest scores are perhaps the mostimportant variable to use in a matching proce-dure. Pretestpost-test correlations on tests likethe MCAS can be very high, and it is importantthat the comparison group and program group areas similar as possible on pretest scores. By takinginto account the 20005 average eighth-grade

    mathematics scores (also known as the CompositePerormance Index), the report tried to ensurethat the comparison schools are comparable onbaseline mathematics scores.

    Tere is substantial research that identies largegaps in academic achievement or racial minori-ties (Jencks & Phillips, 1998), low-income students(Hannaway, 2005), and English language learners(Abedi & Gandara, 2006). Unless these inuenceswere controlled or, any observed dierencesmight have been due to the program or compari-son schools having a higher share o students inthese categories rather than to benchmarking.Although the research on the relationship betweenschool size and academic achievement is some-

    what conicting (Cotton, 1996), the variabilityin school size led the report to introduce into thematching procedure the total enrollment in themiddle school.

    Te eligibility pool or the comparison matches in-cluded the 89 Massachusetts middle schools thatdid not receive the Comprehensive School Reormgrants. Statistical procedures were used to identiythe two best matches or each program schoolrom the eligibility pool. Te covariate matching

    resulted in a nal sample o 22 program schoolsand comparison schools that were nearlyidentical on pretest academic scores.10 In addition,the project design achieved balance on nearly allschool-level social and demographic characteris-tics, except that there were larger shares o AricanAmerican and Pacic Islander students in pro-gram schools. Tese dierences were controlledor statistically in the outcome analysis, with nochange in the results (see appendix D).

    Analyzing the data

    Aer matching, descriptive statistics were usedto examine the mean scores or all ve pre-im-plementation years and one post-implementationyear or the program and comparison schools. Acomparative interrupted time series analysis wasalso used to more rigorously assess whether therewas a statistically signicant dierence between

  • 8/14/2019 description: tags: REL 2007039

    19/47

    12 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    the program and comparison schools in changesin mathematics perormance (see Bloom, 200;Cook & Campbell, 1979). Te interrupted timeseries design was meant to determine whetherthere was any change in the trend because o the

    interruption (program implementation).

    Te method or short interrupted time series inBloom (200) was the analysis strategy. Bloomargues that the approach can measure the impacto a reorm as the subsequent deviation rom thepast pattern o student perormance or a specicgrade (p. 5). Te method establishes the trendin student perormance over time and analyzesthe post-intervention data to determine whetherthere was a departure rom that trend. Tis is a

    tricky business, and trend departures can oen bestatistically signicant. It is important to rule out

    other alternative explanations or any departurerom the trend, such as change in principals, otherschool reorm eorts, and so on. Although Bloomoutlines the method or use in evaluating eectson a set o program schools alone, having a well

    matched group o comparison schools strengthenscausal inerences.

    o project post-implementation mathematicsachievement or each school, both linear baselinetrend models and baseline mean models (seeBloom, 200) were estimated using scaled andraw test score data collected over ve years beorethe intervention. Estimates o implementationeects then come rom dierences-in-dierencesin observed and predicted post-implementation

    test scores between program and comparisonschools.

  • 8/14/2019 description: tags: REL 2007039

    20/47

    appendix b 13

    appenDix b

    consTrucTion o The sTuDy DaTabase

    Te ollowing outlines the specic steps taken toconstruct the study database:

    Identiy all the middle schools in1.Massachusetts.

    Identiy the 25 program schools using bench-2.mark assessments in mathematics.

    Collect the ollowing variables rom the.Massachusetts Department o Education website on each o the schoolsto proceed to thecovariate matching exercise that will identiy

    two matched comparison schools or eachprogram school:

    School name.a.

    Source: http:proles.doe.mass.edui.enrollmentbygrade.aspx.

    CSR implementation.b.

    School locale (urban, rural, and so on).c.

    Source: http:nces.ed.govccdi.districtsearch.

    Does the school have a 6th grade?d.

    Source: http:nces.ed.govccdi.districtsearch.

    Does the school have an eighth grade?e.

    Source: http:nces.ed.govccdi.districtsearch.

    otal enrollment..

    Source: http:proles.doe.mass.edui.enrollmentbygrade.aspx.

    Raceethnicity o student population.g.

    Source: http:proles.doe.mass.edui.enrollmentbyracegender.aspx?mode=school&orderBy=&year=2006.

    Limited English prociency.h.

    Number o students.i.

    Source: http:proles.doe.mass.1.eduselectedpopulations.aspx?mode=school&orderBy=&year=2006.

    Percentage o limited English pro-ii.

    ciency students

    Number o limited English pro-1.ciency students total enrollment

    Low incomei.

    Number o low-income students.i.

    Source: http:proles.doe.mass.1.eduselectedpopulations.aspx

    ?mode=school&orderBy=&year=2006.

    Percentage o low-income students.ii.

    Number o low-income students1.total enrollment.

    Mathematics baseline prociency index.j.

    Source: http:www.doe.mass.edui.

    sdaaypcycleII.

    Seven program schools had missing data or themathematics baseline prociency index, whichwas serving as the measure or academic per-ormance or each school in the matching equa-tion. Tereore, it was substituted with the 2005

  • 8/14/2019 description: tags: REL 2007039

    21/47

    14 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    mathematics Composite Prociency Index (CPI)score to get an accurate academic measure or eachschool. Te 2005 mathematics CPI score was takenrom 19992006 AYP History Data or Schools,which can be ound at http:www.doe.mass.edu

    sdaaypcycleIV.

    Charter and alternative schools were deleted.rom the master le because they would nothave been eligible or the initial program andbecause their populations dier signicantlyin many cases rom those o regular schools.

    Aer the covariate matching was perormed5.on this database, a new variable, SUDYGROUP, was created to determine i a school

    is dened as a program school, a comparisonschool, or an other school.

    Te ollowing additional variables on achieve-6.ment scores were collected or the schools that

    were either program or comparison schools(English scaled and raw scores were also col-lected and added to the database):

    2001 mathematics achievement meana.

    scaled score and mean raw score.

    2002 mathematics achievement meanb.scaled score and mean raw score.

    200 mathematics achievement meanc.scaled score and mean raw score.

    200 mathematics achievement meand.scaled score and mean raw score.

    2005 mathematics achievement meane.scaled score and mean raw score.

    2006 mathematics achievement mean.scaled score and mean raw score.

  • 8/14/2019 description: tags: REL 2007039

    22/47

    appendix c 15

    appenDix c

    iDenTiicaTion o coMparison schools

    Random assignment to study the impact o bench-mark assessment was not possible because selected

    districts and schools had already been awardedthe Comprehensive School Reorm grants. Whenmembers o the group have already been as-signed, the research team must design proceduresor developing a satisactory comparison group.Fortunately, researchers have been developingsuch methods or matching or equating individu-als, groups, schools, or other units or comparisonin an evaluation study or many years. And as onemight imagine, there are many such statistical ap-proaches to creating comparison groups. All such

    approaches, however, have one limitation that awell implemented randomized study does not: theailure to control or the inuence o unobserved(unmeasured or unknown) variables.

    O the many statistical equating techniques, one othe more reliable and requently used is covariatematching. How does covariate matching work?

    Lets say we have unit 1 already as-signed to the program and wish to match

    another unit rom a pool o observationsto 1 to begin creating a comparisongroup that did not receive the program.When using covariate matching, theresearch team would rst identiy theknown and measured actors that wouldbe inuential on the outcome (in this in-stance, academic perormance) regardlesso the program. Inuential in this casemeans that less or more o that charac-teristic has been oundindependent

    o the programto inuence scores onthe outcome measure. Tese are knownas covariates. o reduce or remove theirinuence on the outcome measure, onewould select the unit that is closest to 1on those covariate scores or character-istics. Lets say that X is the closest inthe eligibility pool to 1. By matchingX to 1, there would be two units that

    are very similar on the covariates. I thisis done or each program unit, theoreti-cally the inuence o the covariates willbe removed (or considerably reduced), thedierences between the groups on impor-

    tant known actors will be ameliorated,and a potential explanation or observedresults besides program eectiveness orineectiveness (that the groups were di-erent beore the program on importantcovariates) will be seriously countered(Rubin, 1980).

    For the current project, covariate matching isused to create a set o comparison schools orthe study.11 Te original proposal was to match

    comparison schools using three actors: theSocio-Demographic Composite Index (SCI); theschools adjusted baseline academic perormance,holding constant the schools SCI; and the type ogeographic location the school sits in (urban, sub-urban, rural, and so on). Using two comparisonschools or each program school was eventuallychosen to counter the possibility o idiosyncraticmatching.

    Te rst order o business was to create the SCI.

    Te SCI is simply the predicted mathematicsscore (using the schools 2005 average baselinemathematics score), using a multivariate regressionanalysis, or each school based on a series o socialand demographic covariates. In other words, it isa prediction o students 2005 mathematics scoreusing important covariates such as school enroll-ment, percentage o low-income students, percent-age o English language learners, and percentageo minorityethnic groups.12 In short, multivariateregression was used to predict what the average

    mathematics score or the school is, given knowl-edge about the schools characteristics (how manykids are enrolled, the percentage o low-incomestudents, the percentage o English language learn-ers, and the percentage o minority students).1

    One o the advantages in using multivariateregression is that the actors comprising the SCIare weighted proportionately to how much o the

  • 8/14/2019 description: tags: REL 2007039

    23/47

    16 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    2005 mathematics score they predict. For example,

    i poverty is a more substantial actor in schoolsuccess than enrollment, the regression will givemore weight to the percentage o low-incomestudents that a school has than the schools totalenrollment. Tis is exactly what happened withthe results. able C1 provides the SCI or vemiddle schools with varying percentages o low-income students. As the percentage o low-incomestudents increases, SCI decreases.

    Te other covariate used in the matching proce-

    dure is the schools adjusted academic score. Tisis simply the actual score minus the predictedscore. In other words, i the SCI is subtracted romthe 2005 CPI mathematics score, the result is theadjusted value that was used as the other major co-variate in the matching procedure. able C2 showsthis relationship (numbers do not add up perectlybecause o rounding).

    Te multivariate regression was conducted in bothStata and SPSS and, as might be expected, there

    was perect agreement on the results.

    What next? Finding similar schools using

    Mahalanobis Distance measures

    Variables like SCI and the adjusted mathematicsscore orm a multidimensional space in whicheach school can now be plotted.1 Te middleo this multidimensional space is known as a

    centroid.15 Te Mahalanobis Distance can becomputed as the distance o a case or observa-tion (such as a school) rom the centroid in thismultidimensional space. Schools with similar Ma-halanobis Distance measures are considered to beclose in this multidimensional space, and there-ore more similar. One way to think about howMahalanobis Distance measures are computed isshown in the gure at http:www.jennessent.comimagesgraph_illustration_small_.gi.

    Using a specialized soware program (Stata),Mahalanobis Distance measures were computedor each o the 10 eligible middle schools in thestate.16 able C provides the SCI, the adjustedmathematics achievement score, and the Ma-halanobis Distance measure or ve schools. Ischool 1 was a program school and the remaining schools ormed the pool or comparison schools,the best match according to the analysis would beschool 2, because the Mahalanobis Distance mea-sure or school 2 is more similar to school 1 than

    any o the other potential schools.

    Note that the study plan required that two schoolshad to be matched to each program school. Te 2schools remaining in the program group required6 comparison schools. o complete the match-ing process, a list was printed o the 2 programschools with the Mahalanobis Distance measureand its population geographic area code (that is, the

    table c1

    pt -m tdt d

    s-Dm cmt id td

    Soo

    p o

    o-oss Soo-docoos i

    Soo 1 0 86

    Soo 2 10 77

    Soo 3 25 68

    Soo 4 75 54

    Soo 5 95 47

    Source: Authors analysis based on data described in text.

    table c2

    2005 cmt pm id mtmt

    , s-Dm cmt id, ddjtd dm td

    Soo

    2005

    coospo

    i

    s

    Soo-

    do

    coos i

    ajs

    so

    Soo 1 90 86 4

    Soo 2 80 77 3

    Soo 3 59 68 10

    Soo 4 51 54 3

    Soo 5 50 47 2

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    24/47

    appendix c 17

    type o geographic location the school serves, such

    as small urban or suburban), as schools were classi-ed according to the National Center or EducationStatistics Common Core o Data. A similar list wasprinted o the 87 remaining middle schools thatcomprised the potential schools or comparisonmatching. Te matching was conducted by simplyselecting the two schools with the closest Ma-halanobis Distance measures and with the exact orvery similar population geographic area code. Tisis also known as nearest neighbor matching. Teinitial matching resulted in 6 comparison schools

    matched to 2 program schools. Te ndings arepresented in table C.

    Because there are a variety o matching meth-ods, and some variations on using MahalanobisDistance measures or matching, replication o thendings was initiated.17

    David Kantor developed a dierent procedure orusing Mahalanobis Distances to orm a compari-son group in Stata called Mahapick.18 Rather than

    compute Mahalanobis Distance measures romthe centroid o a multidimensional space, as inthe earlier procedure, Mahapick creates a measurebased on the distance rom a treated observationto every other observation. It then chooses the bestmatches or that treated observation and makes arecord o these matches. It then drops the mea-sure and goes on to repeat the process on the nexttreated observation.

    Because Mahapick uses a dierent method andproduces a dierent Mahalanobis Distance score,the goal was not to conrm whether the scores wereidentical. Te goal was to see i a similar set o schoolswas constructed using a dierent matching method.19

    One problem with using Mahapick is that thecomputation does not produce exclusive matches.Te procedure selected 12 duplicate matches. Nineschools, however, were the results o exact matchesin both procedures. Nearly hal o the schools wereselected by both (22 o 6), and this number mighthave been higher had Mahapick selected exclusivematches, that is, i it had not matched one com-parison school to more than one program school.

    Both methods produced a large number o mid-sized urban schools with Mahalanobis Distancescores that clustered very closely together. Tisis not surprising, as the Mahalanobis Distancemeasure scores (using the initial Stata procedure)or the program schools clustered between 12.25and 1.56. Tis meant that the comparison groupschools would likely be drawn rom schools inthe eligible pool whose distance measure scoresalso ell into this range. O the 87 schools in theeligibility pool, 166 had Mahalanobis Distance

    measure scores o 12.251.56 ( percent).

    Because the two procedures did not produce theexact 6 comparison schools, a combination othe results rom the initial Stata procedure andMahapick was used to select the next iteration.Putting the nine exact matches produced by bothprocedures aside, 7 comparison schools were leto identiy. Te selected matches or each programschool provided by the initial Stata procedure andMahapick were examined. Once two comparison

    schools or a program school were selected, theywere removed rom consideration. In those casesin which there were more than two schools identi-ed by the two matching procedures, the one withthe higher adjusted 2005 mathematics score wasselected. Tis decision is a conservative one andinitially presents a bias avoring the comparisongroup. Te results o using both procedures toperorm the matching are provided in table C6.

    table c3

    s-Dm cmt id, djtd

    dm , d M Dt td

    Soo

    Soo-

    docoos i ajsso

    mos

    dsso

    Soo 1 86 4 49.18

    Soo 2 77 3 38.54

    Soo 3 68 10 32.30

    Soo 4 54 3 19.15

    Soo 5 47 2 14.87

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    25/47

    18 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    Following this matching procedure, the pretestachievement data les rom 200105 were added.During this process, it became known that 1 o theoriginal 25 schools, because it is a reconguredschool, did not have pretest data or eighth grad-

    ers. Tis school was dropped rom the programsample (along with the corresponding two com-parison schools). In addition, another school romthe original matching set was dropped because ittoo did not have eighth-grade pretest data. It wasreplaced with another comparable school.

    How well did the matching procedure work?

    o test the eectiveness o the matching proce-dure, the 22 program schools and comparison

    schools were compared across the variables in theschool-level dataset. able C presents the results

    rom that comparison. In summary, the equatingor matching process resulted in certain variablesavoring the comparison schools (or example,higher baseline mathematics and Englishlan-guage arts scores). Some o this might be due to

    the matching procedure as comparison schoolswith a higher 2005 adjusted mathematics scorewere selected when there was more than one pos-sible school to pick rom or a match. wo o thevariables were statistically signicant (2005 base-line mathematics scores and percentage o AricanAmerican students).

    Both o these dierences were troubling, especiallygiven that they were included in the covariatematching process. o urther investigate whether

    there were systemic dierences between theprogram and comparison schools on the pretest

    table c4

    cm m d md t m d m

    cs

    po soos

    (n=22)

    coso soos

    (n=44)

    m m m m

    2005 ms coos po i 53.10* 53.80* 60.56* 59.85*

    2005 es s coos po i 69.51 69.50 74.17 76.90

    eo 620.23 635.00 562.95 578.50

    lo-o ss () 61.32 64.60 54.71 51.70

    es s () 16.74 13.90 11.23 7.80

    a a () 7.45* 6.10* 14.99* 10.05*

    hs () 29.81 23.00 26.40 14.10

    as () 9.60 4.00 6.60 5.20

    w () 51.80 55.20 49.80 47.60

    n a () 0.26 0.20 0.41 0.25

    h/p is () 0.11 0.00 0.08 0.00

    m- o-hs () 1.00 0.70 1.73 1.30

    r- oos () 48.20 44.90 50.20 52.40

    h q s () 90.10 91.90 89.40 94.70

    Soo oo n p n p

    m-sz 17 77.30 34 77.30

    u o 3 13.60 6 13.60

    u o -sz 2 9.10 4 9.10

    * Statistically signifcant at the 0.05 level using a t-test (two-tailed).

    Note: Race-ethnicity composite is the sum Arican American, Hispanic, Asian, Native American, Hawaiian, Pacifc Islander, and multirace non-Hispanic.

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    26/47

    appendix c 19

    achievement years, analyses o scaled eighth-grademathematics scores or each year o pretest dataavailable were examined. Te dierences betweenthe two groups or each year o pretest data werestatistically signicant and avored the compari-

    son schools. able C5 presents the results o thet-tests. Rerunning the t-tests using raw scores didnot change the results.

    Resolving matching problems

    Revisiting the conservative tie-breaker.Notethat when there were multiple schools el igibleor the matching, the school that had higherachievement scores (using the 2005 CPI baselinemeasure) was selected. Tis might have explained

    the lack o equivalence on the pretest basel inemathematics scores, with the procedure inat-ing these pretest scores. Because achievementscores are highly correlated and the 2005 CPI

    actually represents an average o the 200 and2005 MCAS mathematics scores, it was reason-able to assume that this conservative decisionwas responsible or the lack o equivalence acrossall years.

    Surprisingly, however, when the data wereclosely examined, the equivalence problem didnot exist or only schools with the conservativetie-breaker decision. Te higher pretest scoresor the comparison schools were consistent acrossmost o the program schools, including thosewhere such choices were not made. Tis led tourther investigations.

    Revisiting Mahapick. Mahapick does not permit

    exclusive matches, so running the procedureresults in the same schools getting selected ascomparisons or more than one program school.Tis happened with eleven schools. One attempt

    table c5

    T-tt d tt mtmt t t m d m , 200105

    t-ss -ss d

    S

    (o-)

    m

    S

    o

    95 o

    o

    ms 2001 8 s so

    eq s ss 2.161 50.000 0.036 3.72213 1.72237 7.18162 0.26265

    eq s o ss 2.480 44.996 0.017 3.72213 1.50087 6.74506 0.69921

    ms 2002 8 s so

    eq s ss 2.431 51.000 0.019 4.33164 1.78158 7.90832 0.75496

    eq s o ss 2.918 48.594 0.005 4.33164 1.48465 7.31579 1.34749

    ms 2003 8 s so

    eq s ss 2.043 55.000 0.046 3.11489 1.52486 6.17077 0.05900

    eq s o ss 2.362 47.665 0.022 3.11489 1.31898 5.76736 0.46241

    ms 2004 8 s so

    eq s ss 2.070 59.000 0.043 3.04558 1.47100 5.98904 0.10212

    eq s o ss 2.369 48.800 0.022 3.04558 1.28561 5.62938 0.46179

    ms 2005 8 s so

    eq s ss 2.342 64.000 0.022 3.24972 1.38767 6.02191 0.47753

    eq s o ss 2.723 60.799 0.008 3.24972 1.19360 5.63664 0.86281

    ms 2006 8 s so

    eq s ss 1.945 64.000 0.056 2.91159 1.49668 5.90155 0.07837

    eq s o ss 2.103 51.800 0.040 2.91159 1.38451 5.69006 0.13312

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    27/47

    20 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    to remedy this problem: Mahapick creates the bestmatch or each program observation (in this case,the program school), beginning with the very rstcase. By removing the two comparison schoolsaer each match is made, the problem o non-

    exclusive selections is eliminated.

    Mahapick was thereore used in this way, run-ning 22 separate analyses, or one separate analysisor each program school. As the two comparisonmatches were made, they were eliminated romthe next run, and so on, until the Mahapickprocedure selected the unique comparisonschools that were needed. Although the results othis procedure produced a set o schools that werecloser on the pretest achievement measures (di-

    erences were not signicant), the measures werestill higher or comparison schools than programschools during 2001 and 2002 and close to signi-cant (table C6).

    Adding the 2005 CPI mathematics baseline score

    as an additional sort variable. Finally, it wasdetermined that the best way to create equiva-lence on the pretest achievement measures was toredo the sort and match again. Tis time, instead

    o sorting solely on the Mahalanobis Distancemeasure score and geographic location, the 2005CPI baseline mathematics score was included.Printouts o both program and potentially eligiblecomparison schools sorted on these three variableswere prepared. Te priority was to ensure thatschools were as close as possible on the 2005 CPIbaseline mathematics score and distance measurewithin each geographic location category. Teuse o the 2005 CPI baseline mathematics scoretogether with the distance measure resulted in a

    new set o comparison schools. Note that onenew comparison school did not have any pretestachievement data and was replaced with a similarschool. Although there was considerable overlap

    table c6

    T-tt d tt d mtmt , 200105 M m

    t-ss -ss d

    S

    (o-)

    m

    S

    o

    95 o

    o

    ms 2001 8 s so

    eq s ss 1.428 46.000 0.160 2.55277 1.78826 6.15235 1.04681

    eq s o ss 1.616 44.688 0.113 2.55277 1.57942 5.73450 0.62895

    ms 2002 8 s so

    eq s ss 1.514 47.000 0.137 2.41042 1.59251 5.61413 0.79329

    eq s o ss 1.704 44.217 0.095 2.41042 1.41459 5.26095 0.44011

    ms 2003 8 s so

    eq s ss 0.803 50.000 0.426 1.15321 1.43681 4.03912 1.73270

    eq s o ss 0.884 44.829 0.382 1.15321 1.30506 3.78200 1.47559

    ms 2004 8 s so

    eq s ss 0.167 58.000 0.868 0.23275 1.39500 3.02515 2.55966

    eq s o ss 0.186 46.255 0.853 .23275 1.25220 2.75292 2.28743

    ms 2005 8 s so

    eq s ss 0.303 63.000 0.763 0.34596 1.14138 2.62682 1.93491

    eq s o ss 0.326 51.785 0.745 0.34596 1.05996 2.47313 1.78122

    ms 2006 8 s so

    eq s ss 0.261 64.000 0.795 0.36091 1.38324 3.12426 2.40244

    eq s o ss 0.272 47.280 0.786 0.36091 1.32468 3.02541 2.30359

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    28/47

    appendix c 21

    with the earlier listings o schools, the t-tests oequivalence showed a near perect balance on thepretest achievement measures (table C7).

    Te one variable that remained statistically

    signicant was the dierence on the percentage oArican American students enrolled in the school.One possible reason or this imbalance is that thestate grants were provided to a rather constrictedrange o middle schools in Massachusetts. Tesewere mostly small city urban schools with diversepopulations, and the average 2005 CPI mathemat-ics score was clustered to the lower or middle parto the distribution. With all these actors in play,

    there was a limited pool o comparison schools orachieving perect balance on all pre-existing vari-ables. aylor (198) notes that matching on somevariables may result in mismatching on others.

    Te nal set o matches represents the most rigor-ous set o comparison schools that could havebeen selected given the limited eligibility pool.Although the imbalance on the percentage oArican American students enrolled at the schoolsremains troubling,20 the variable (AFAM) wasintroduced as a covariate in the nal time seriesanalysis. It makes no dierence in the results (seeappendix D).

    table c7

    cm m d md m d m

    cs

    po soos

    (n=22)

    coso soos

    (n=44)

    m m m m

    2005 ms coos po i 53.10 53.80 52.82 54.25

    2005 es s coos po i 77.15 77.05 73.86 74.05

    eo 620.23 635.00 547.73 577.50

    lo-o ss () 61.32 64.60 62.92 65.50

    es s () 16.74 13.90 11.02 9.90

    a a () 7.45* 6.10* 15.36* 11.30*

    hs () 29.81 23.00 33.93 26.65

    as () 9.58 4.00 5.24 2.85

    w () 51.77 55.15 43.48 37.95

    n a () 0.26 0.20 0.36 0.25

    h/p is () 0.11* 0.00* 0.02* 0.00*

    m- o-hs () 1.00 0.70 1.60 1.25

    r- oos () 48.22 44.90 56.52 62.05

    h q s () 90.08 91.90 86.42 88.55

    Soo oo n p n p

    m-sz 17 77.3 34 77.3

    u o 3 13.6 6 13.6

    u o -sz 2 9.1 4 9.1

    * Statistically signifcant at the 0.05 level using a t-test (two-tailed).

    Note: Race-ethnicity composite is the sum Arican American, Hispanic, Asian, Native American, Hawaiian, Pacifc Islander, and multirace non-Hispanic.

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    29/47

    22 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    appenDix D

    inTerrupTeD TiMe series analysis

    Conventional interrupted time series analysisgenerally requires multiple data points beore

    and aer an intervention (or interruption) andthe use o administrative or other data that isregularly and uniormly collected over time. It iscommon, or example, in interrupted time seriesanalyses o the eects o laws or policies on crimeor researchers to use monthly or weekly crimedata to create more pretest and post-test points.All things being equal, the more data points in atime series, the more stable the analysis.

    In education some commonly used achievement

    outcomes (such as standardized mathematics testscores) are usually administered once per year.Tus, the multiple pretests and post-tests thatare common to conventional time series analysesmay not be available when evaluating the impacto school innovations on student achievement. Inthis report, only ve years o annual mathematicstest score data and one post-test administrationwere available. Clearly, with only six data points,it would not be possible to conduct conventionaltime series analysis.

    Blooms (200) method or short interrupted timeseries, outlined in an evaluation o AcceleratedSchools by MDRC, served as the analysis strategy.In short, Bloom (200) argues that his approachcan . . . measure the impact o a reorm as thesubsequent deviation rom the past pattern ostudent perormance or a specic grade (p.5).Blooms method establishes the trend in studentperormance over time and then analyzes the post-program data to determine i there is a departure

    rom that trend. As noted in the report, this is atricky business, and trend departures can oen bestatistically signicant. Although Bloom (200)outlines his approach or use in evaluating theimpact on a set o program schools alone, he rec-ognizes the importance o having a well matchedcomparison group o schools to strengthen causalinerences.

    Note, however, that Blooms (200) paper describesan evaluation that had ve ull years o student-level test score data and ve ull years o post-teststudent-level data. In this report, available at thistime are only school-level means or one post-

    intervention year. Bloom (200) may argue thatthis is not a air test, as one year does not allowthe school reorm to be implemented to its ulleststrength. Nonetheless, this should be viewed asa valuable oundational eort in the RegionalEducational Laboratorys research on the impacto benchmark assessment.

    Reconstructing the database

    Te rst order o business was to convert the da-

    tabase rom one in which each row represented allthe data or each school (66 rows o data) to one inwhich each row represented a dierent year o ei-ther pretest or post-test inormation. For example,the comparison group schools represented20 unique rows o data; the 22 program schoolsrepresented 115 distinct rows o pretest or post-testinormation. Te database, aer reconstruction,consisted o 5 total rows o data (rather than 66).Variables were also renamed and reordered in thedatabase, to ease analysis.21

    A series o models analogous to Blooms (200)recommendations was then run to determine,using more advanced statistical analysis, whetherthere was any observed program impact oneighth-grade mathematics outcomes. Bloomspaper provides three potential time series modelsto take into account when constructing statisti-cal analyses, and each has dierent implicationsor how the analysis is done. Bloom argues thatthe type o statistical model must take into ac-

    count the type o trend line or the data. Te threemodels include the linear trend model (in whichthe outcome variable increases incrementallyover time), the baseline mean model (in whichthe outcome variable appears to be a at line overtime with no discernible increase or decrease), andthe nonlinear baseline trend model (in which theoutcome scores may be moving in a curvilinear or

  • 8/14/2019 description: tags: REL 2007039

    30/47

    appendix d 23

    other pattern). Te outcome data rom the pretestsclearly showed that the most applicable model waslikely the baseline mean model, but given thatthere was a slight increase over time (rom 2001 to2006), the linear trend model could not be ruled

    out. So the analyses were run using both models.

    For each o the two models, analyses were runto determine i there was an eect in 2006 orprogram and comparison schools separatelyadierence-in-dierence eect (or eect betweenprogram and comparison schools)and thencovariates were introduced to determine i any othe estimates changed or time or or program im-pact when variables such as percentage o AricanAmerican students enrolled at the schools were

    introduced.22 Variables used in the analysis aredescribed in table D1.

    First, using a baseline mean model as describedby Bloom (200), the researchers investigated i

    there was a perceptible immediate change rom200105 to 2006. Tis was done or comparisonschools and program schools separately. Whenlooking at program schools alone in table D2,there appears to be a signicant increase in 2006.

    Tis increase (Y2006) represents a 1.86 test pointimprovement over what would have been expectedin the absence o the program.

    It would have been possible to conclude thatbenchmark assessment had a statistically signi-cant and positive impact on the implementationyear mathematics outcomes rom the results intable D2. But table D highlights the importanceo the comparison group. Te results or com-parison schools also show a signicant increase in

    2006. Te increase is modest (1.8 test points) butalso statistically signicant. Tus, both programand comparison schools experienced signicantand positive upward trends in 2006 that departedrom past perormance.

    Te dierence-in-dierence test, which is the mostcritical because it provides a direct comparison be-tween the program and comparison schools, showsno signicant dierence, as highlighted in table D.Tere is a signicant increase in 2006, as expected,

    table d1

    v d t

    v dso

    a

    p o ss o soo

    o a a.

    as

    p o ss o soo

    o as.

    hs

    p o ss o soo

    o hs.

    hq

    p o q s

    soo.

    i_1 e o o.

    i m sos.

    iy2006_1

    So s 2006, o

    oso soos.

    i20~1

    io 2006

    soo s o.

    l

    p o ss soo ss

    s es o.

    l

    p o ss soo ss

    s o o.

    to n o ss o soo.

    w

    p o ss o soo

    o w.

    y2006 So s 2006.

    table d2

    b m md, m n=22 , 115 t

    v cof

    S

    o po

    i 225.11 0.822 0.000

    y2006 1.86 0.556 0.001

    Source: Authors analysis based on data described in text.

    table d3

    b m md, m n=44 , 230 t

    v cof

    S

    o po

    i 224.69 0.781 0.000

    y2006 1.48 0.57 0.009

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    31/47

    24 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    as this analysis combines the program and compar-ison schools (IY2006_1). Whether a school was inthe program or comparison group did not appearto have any impact (Itreat_1). But the key variable

    is the interaction between year 2006 and whether aschool was in the program group, as represented byIy20xtre~1. Te program eect is about 0.8 o amathematics test point, but it is not signicant andcould have occurred by chance alone. Te most ac-curate interpretation is that both groups are slightlyincreasing, but the dierence between them is neg-ligible. Tereore, any observable increase cannot beattributed to the program.

    Although covariates should have been controlled

    by the matching procedure, analyses in appendixC showed that there were some dierences onracialethnic variables. Tese and other covariateswere introduced into the dierence-in-dierenceanalyses to see i the estimate or program eectswould change. As the reader can see rom tableD5, the introduction o a number o variables intothe regression did not change the estimate orprogram impact.

    Te same analyses described above were repeated,

    but the linear trend model outlined in Bloom(200) was now assumed instead o the baselinemean model. Assuming dierent models simplymeans that dierent statistical ormulae were usedto conduct the analyses. able D6 presents the dataor program schools alone. Te table shows thatwhen time is controlled in the analysis, the statis-tically signicant eect or Y2006 (or the programseparately) disappears.

    Te analysis or comparison schools alone was re-peated in table D7. Again, the statistically signi-cant ndings, assuming the baseline mean model,drop when assuming the linear trend model.

    table d6

    l td md, m n=22 , 115 t

    v cof

    S

    o po

    i 224.370 0.825 0.000

    y2006 0.975 0.741 0.180

    t 0.325 0.174 0.060

    Source: Authors analysis based on data described in text.

    table d7

    l td md, m n=44 , 230 t

    v cof

    S

    o po

    i 224.260 0.887 0.000

    y2006 0.981 0.749 0.190

    t 0.187 0.180 0.300

    Source: Authors analysis based on data described in text.

    table d4

    b m md, d--d

    tmt n=66 , 345 t

    v cof

    S

    o po

    i 224.690 0.722 0.000iy2006_1 1.480 0.517 0.004

    i_1 0.421 1.250 0.340

    i20_~1 0.379 0.899 0.420

    Source: Authors analysis based on data described in text.

    table d5

    b m md, d--

    d tmt, t tn=66 , 345 t

    v cof

    S

    o poi 242.140 22.290 0.000

    iy2006_1 1.540 0.518 0.003

    i_1 0.159 0.948 0.860

    i20_~1 0.416 0.900 0.640

    a 0.067 0.236 0.770

    as 0.043 0.248 0.860

    hs 0.087 0.228 0.700

    w 0.145 0.230 0.520

    to 0.001 0.001 0.440

    l 0.049 0.070 0.480

    l 0.236 0.036 0.000

    hq 0.076 0.039 0.050

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    32/47

    appendix d 25

    Assuming the linear trend model, the dierence-in-dierence estimates in table D8 nearly rep-licated the results in the baseline mean model.

    Again, the program impact is 0.7 o a scaledpoint on the mathematics test, but this dierenceis again not signicant and could easily have oc-curred by chance.

    In table D9 covariates were again introduced intothe dierence-in-dierence analysis. Te resultsare similar to the baseline mean model exceptthat the time variable is also introduced into theanalysis.

    Finally, given that Massachusetts ComprehensiveAssessment System (MCAS) mathematics scaled

    scores are transormations rom raw scores,2 re-searchers examined the raw scores that representthe actual numeric score that the students receivedon the MCAS. Te results were nearly identical, orboth the baseline mean and linear trend models,

    to the analyses reported above. Tese analyses areavailable upon request.

    table d8

    l td md, d--d

    tmt n=66 , 345 t

    v cof

    S

    o po

    i 224.150 0.787 0.000t 0.234 0.132 0.070

    iy2006_1 0.855 0.626 0.170

    i_1 0.432 1.260 0.730

    i20_~1 0.368 0.894 0.410

    Source: Authors analysis based on data described in text.

    table d9

    l td md, d--d

    tmt, t tn=66 , 345 t

    v cof

    S

    o poi 241.430 21.470 0.000

    t 0.274 0.133 0.040

    iy2006_1 0.804 0.632 0.200

    i_1 0.187 0.915 0.830

    i20_~1 0.410 0.902 0.640

    a 0.069 0.228 0.760

    as 0.050 0.239 0.830

    hs 0.091 0.219 0.670

    w 0.145 0.222 0.510

    to 0.001 0.001 0.380

    l 0.053 0.068 0.430

    l 0.234 0.035 0.000

    hq 0.077 0.038 0.040

    Source: Authors analysis based on data described in text.

  • 8/14/2019 description: tags: REL 2007039

    33/47

    26 meaSuring hOw benchmark aSSeSSmentS aect Student achievement

    appenDix e

    MassachuseTTs curriculuM raMeworks

    or graDe 8 MaTheMaTics (May 2004)

    Number sense and operations strand

    Topic 1: Numbers

    Grades 78:

    8.N.1. Compare, order, estimate, and translateamong integers, ractions and mixed numbers(rational numbers), decimals, and percents.

    8.N.2. Dene, compare, order, and apply requentlyused irrational numbers, such as 2 and .

    8.N.. Use ratios and proportions in the solution oproblems, in particular, problems involving unitrates, scale actors, and rate o change.

    8.N.. Represent numbers in scientic nota-tion, and use them in calculations and problemsituations.

    8.N.5. Apply number theory concepts, includingprime actorization and relatively prime numbers,

    to the solution o problems.

    Grade (All