Translating the statistical representation of the effects of education interventions

download Translating the statistical representation of the effects of education interventions

of 54

Transcript of Translating the statistical representation of the effects of education interventions

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    1/54

    NCSER 2013-300

    U.S. DEPARTMENT OF EDUCATION

    Translating the Statistical

    Representation of the Effects of

    Education Interventions Into MoreReadily Interpretable Forms

    NOVEMBER 2012

    Mark W. Lipsey, Kelly Puzio, Cathy Yun, Michael A. Hebert, Kasia Steinka-Fry,

    Mikel W. Cole, Megan Roberts, Karen S. Anthony, and Matthew D. Busick

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    2/54

    Page intentionally let blank.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    3/54

    Translating the StatisticalRepresentation of the Effects ofEducation Interventions Into More

    Readily Interpretable Forms

    NOVEMBER 2012

    Mark W. LipseyPeabody Research InstituteVanderbilt University

    Kelly PuzioDepartment o eaching and LearningWashington State University

    Cathy YunVanderbilt University

    Michael A. HebertDepartment o Special Education and Communication DisordersUniversity o Nebraska-Lincoln

    Kasia Steinka-FryPeabody Research InstituteVanderbilt University

    Mikel W. ColeEugene . Moore School o EducationClemson University

    Megan RobertsHearing & Speech Sciences DepartmentVanderbilt University

    Karen S. AnthonyVanderbilt University

    and

    Matthew D. Busick

    Vanderbilt University

    NCSER 2013-3000

    U.S. DEPARTMENT OF EDUCATION

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    4/54

    iv

    Tis report was prepared or the National Center or Special Education Research, Institute o Education

    Sciences under Contract ED-IES-09-C-0021.

    Disclaimer

    Te Institute o Education Sciences (IES) at the U.S. Department o Education contracted with Command

    Decisions Systems & Solutions to develop a report that assists with the translation o eect size statistics

    into more readily interpretable orms or practitioners, policymakers, and researchers. Te views expressedin this report are those o the author and they do not necessarily represent the opinions and positions o the

    Institute o Education Sciences or the U.S. Department o Education.

    U.S. Department o Education

    Arne Duncan, Secretary

    Institute o Education Sciences

    John Q. Easton, Director

    National Center or Special Education ResearchDeborah Speece, Commissioner

    November 2012

    Tis report is in the public domain. While permission to reprint this publication is not necessary, the

    citation should be: Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, M.W., Roberts,

    M., Anthony, K.S., Busick, M.D. (2012). ranslating the Statistical Representation o the Eects o

    Education Interventions into More Readily Interpretable Forms. (NCSER 2013-3000). Washington, DC:

    National Center or Special Education Research, Institute o Education Sciences, U.S. Department o

    Education. Tis report is available on the IES website at http://ies.ed.gov/ncser/.

    Alternate Formats

    Upon request, this report is available in alternate ormats such as Braille, large print, audiotape, or computer

    diskette. For more inormation, please contact the Departments Alternate Format Center at 202-260-9895

    or 202-205-8113.

    Disclosure of Potential Conicts of Interest

    Tere are nine authors or this report with whom IES contracted to develop the discussion o the issues

    presented. Mark W. Lipsey, Cathy Yun, Kasia Steinka-Fry, Megan Roberts, Karen S. Anthony, and MatthewD. Busick are employees or graduate students at Vanderbilt University; Kelly Puzio is an employee at

    Washington State University; Michael A. Hebert is an employee at University o Nebraska-Lincoln; and

    Mikel W. Cole is an employee at Clemson University. Te authors do not have nancial interests that could

    be aected by the content in this report.

    http://ies.ed.gov/ncser/http://ies.ed.gov/ncser/
  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    5/54

    v

    Contents

    List o Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

    List o Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Organization and Key Temes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

    Inappropriate and Misleading Characterizations o the Magnitude o Intervention Eects . . . . . . 3

    Representing Eects Descriptively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Conguring the Initial Statistics that Describe an Intervention Eect to Support

    Alternative Descriptive Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

    Covariate Adjustments to the Means on the Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

    Identiying or Obtaining Appropriate Eect Size Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

    Descriptive Representations o Intervention Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

    Representation in erms o the Original Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

    Standard Scores and Normal Curve Equivalents (NCE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

    Grade Equivalent Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

    Assessing the Practical Signifcance o Intervention Eects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Benchmarking Against Normative Expectations or Academic Growth . . . . . . . . . . . . . . . . . . . . . .26

    Benchmarking Against Policy-Relevant Perormance Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

    Benchmarking Against Dierences Among Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

    Benchmarking Against Dierences Among Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

    Benchmarking Against the Observed Eect Sizes or Similar Interventions . . . . . . . . . . . . . . . . . . .33

    Benchmarking Eects Relative to Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

    Calculating otal Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

    Cost-eectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Cost-beneft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

    Reerences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    6/54

    vi

    List of Tables

    1. Pre-post change dierentials that result in the same posttest dierence . . . . . . . . . . . . . . . . . . . .12

    2. Upper percentiles or selected dierences or gains rom a lower percentile . . . . . . . . . . . . . . . . .15

    3. Proportion o intervention cases above the mean o the control distribution. . . . . . . . . . . . . . . .19

    4. Relationship o the eect size and correlation coecient to the BESD . . . . . . . . . . . . . . . . . . . .20

    5. Annual achievement gain: Mean eect sizes across seven nationally-normed tests . . . . . . . . . . . .28

    6. Demographic perormance gaps on mean NAEP scores as eect sizes. . . . . . . . . . . . . . . . . . . . .30

    7. Demographic perormance gaps on SA 9 scores in a large urban school

    district as eect sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

    8. Perormance gaps between average and weak schools as eect sizes . . . . . . . . . . . . . . . . . . . . . . .32

    9. Achievement eect sizes rom randomized studies broken out by type o test

    and grade level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

    10. Achievement eect sizes rom randomized studies broken out by type o interventionand target recipients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

    11. Estimated costs o two ctional high school interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

    12. Cost-eectiveness estimates or two ctional high school interventions. . . . . . . . . . . . . . . . . . . .41

    Table Page

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    7/54

    vii

    List of Figures

    1. Pre-post change or the three scenarios with the same posttest dierence . . . . . . . . . . . . . . . . . .13

    2. Intervention and control distributions on an outcome variable. . . . . . . . . . . . . . . . . . . . . . . . . .14

    3. Percentile values on the control distribution o the means o the control and

    intervention groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

    4. Proportion o the control and intervention distributions scoring above an

    externally dened prociency threshold score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

    5. Binomial eect size displayProportion o cases above and below the

    grand median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

    6. Mean reading grade equivalent (GE) scores o success or all and control samples

    [Adapted rom Slavin et al. 1996] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

    Figure Page

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    8/54

    Page intentionally let blank.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    9/54

    1

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Introduction

    Te superintendent o an urban school district reads an evaluation o the eects o a vocabulary building

    program on the reading ability o fth graders in which the primary outcome measure was the CA/5

    reading achievement test. Te mean posttest score or the intervention sample was 718 compared to 703

    or the control sample. Te vocabulary building program thus increased reading ability, on average, by15 points on the CA/5. According to the report, this dierence is statistically signifcant, but is this a

    big eect or a trivial one? Do the students who participated in the program read a lot better now, or just

    a little better? I they were poor readers beore, is this a big enough eect to now make them profcient

    readers? I they were behind their peers, have they now caught up?

    Knowing that this intervention produced a statistically signicant positive eect is not particularly helpul to

    the superintendent in our story. Someone intimately amiliar with the CA/5 (Caliornia Achievement est,

    5th edition; CB/McGraw Hill 1996) and its scoring may be able to look at these means and understand

    the magnitude o the eect in practical terms but, or most o us, these numbers have little inherentmeaning. Tis situation is not unusualthe native statistical representations o the ndings o studies o

    intervention eects oten provide little insight into the practical magnitude and meaning o those eects. o

    communicate that important inormation to researchers, practitioners, and policymakers, those statistical

    representations must be translated into some orm that makes their practical signicance easier to iner. Even

    better would be some ramework or directly assessing their practical signicance.

    Tis paper is directed to researchers who conduct and report education intervention studies. Its purpose

    is to stimulate and guide them to go a step beyond reporting the statistics that emerge rom their analysis

    o the dierences between experimental groups on the respective outcome variables. With what is oten

    very minimal additional eort, those statistical representations can be translated into orms that allow theirmagnitude and practical signicance to be more readily understood by the practitioners, policymakers, and

    even other researchers who are interested in the intervention that was evaluated.

    Organization and Key Themes

    Te primary purpose o this paper is to provide suggestions to researchers about ways to present statistical

    ndings about the eects o educational interventions that might make the nature and magnitude o those

    eects easier to understand. Tese suggestions and the related discussion are ramed within the context o

    studies that use experimental designs to compare measured outcomes or two groups o participants, one

    in an intervention condition and the other in a control condition. Tough this is a common and, in many

    ways, prototypical orm or studies o intervention eects, there are other important orms. Tough not

    addressed directly, much o what is suggested here can be applied with modest adaptation to experimental

    studies that compare outcomes or more than two groups or compare conditions that do not include

    a control (e.g., compare dierent interventions), and to quasi-experiments that compare outcomes or

    nonrandomized groups. Other kinds o intervention studies that appear in educational research are beyond

    the scope o this paper. Most notable among those other kinds are observational studies, e.g., multivariate

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    10/54

    2

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    analysis o the relationship across schools between natural variation in per pupil unding and student

    achievement, and single case research designs such as those oten used in special education contexts to

    investigate the eects o interventions or children with low-incidence disabilities.

    Te discussion in the remainder o this paper is divided into three main sections, each addressing a relatively

    distinct aspect o the issue. Te rst section examines two common, but inappropriate and misleading waysto characterize the magnitude o intervention eects. Its purpose is to caution researchers about the problems

    with these approaches and provide some context or consideration o better alternatives.

    Te second section reviews a number o ways to represent intervention eects descriptively. Te ocus there

    is on how to better communicate the nature and magnitude o the eect represented by the dierence on an

    outcome variable between the intervention and control samples. For example, it may be possible to express

    that dierence in terms o percentiles or the contrasting proportions o intervention and control participants

    scoring above a meaningul threshold value. Represented in terms such as those, the nature and magnitude

    o the intervention eect may be more easily understood and appreciated than when presented as means,

    regression coecients,p-values, standard errors, and the like.

    Te point o departure or descriptive representations o an intervention eect is the set o statistics

    generated by whatever analysis the researcher uses to estimate that eect. Most relevant are the means and,

    or some purposes, the standard deviations on the outcome variable or the intervention and control groups.

    Alternatively, the point o departure might be the eect size estimate, which combines inormation rom the

    group means and standard deviations and is an increasingly common and requently recommended way to

    report intervention eects. However, not every analysis routine automatically generates the statistics that are

    most appropriate or directly deriving alternative descriptive representations or or computing the eect size

    statistic as an intermediate step in deriving such representations. Tis second section o the paper, thereore,begins with a subsection that provides advice about obtaining the basic statistics that support the various

    representations o intervention eects that are described in the subsections that ollow it.

    Te third section o this paper sketches some approaches that might be used to go beyond descriptive

    representations to more directly reveal the practical signicance o an intervention eect. o accomplish that,

    the observed eect must be assessed in relationship to some externally dened standard, target, or rame o

    reerence that carries inormation about what constitutes practical signicance in the respective intervention

    domain. Covered in that section are approaches that benchmark eects within such rameworks as normative

    growth, dierences between students and schools with recognized practical signicance, the eects ound or

    other similar interventions, and cost.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    11/54

    3

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Inappropriate and Misleading Characterizations of

    the Magnitude of Intervention Effects

    Some o the most common ways to characterize the eects ound in studies o educational interventions are

    inappropriate or misleading and thus best avoided. Te statistical tests routinely applied to the dierencebetween the means on outcome variables or intervention and control samples, or instance, yield ap value

    the estimated probability that a dierence that large would be ound when, in act, there was no dierence

    in the population rom which the samples were drawn. Very signicant dierences with, say, p < .001

    are oten trumpeted as i they were indicative o especially large and important eects, ones that are more

    signicant than i p were only marginally signicant (e.g.,p = .10) or just conventionally signicant (e.g.,

    p = .05). Such interpretations are quite inappropriate. Tep-values characterize only statistical signicance,

    which bears no necessary relationship to practical signicance or even to the statistical magnitude o the

    eect. Statistical signicance is a unction o the magnitude o the dierence between the means, to be sure,

    but it is also heavily infuenced by the sample size, the within samples variance on the outcome variable, the

    covariates included in the analysis, and the type o statistical test applied. None o the latter is related in anyway to the magnitude or importance o the eect.

    When researchers go beyond simply presenting the intervention and control group means and thep-value or

    the signicance test o their dierence, the most common way to represent the eect is with a standardized

    eect size statistic. For continuous outcome variables, this is almost always the standardized mean dierence

    eect sizethe dierence between the means on an outcome variable represented in standard deviation

    units. For example, a 10 point dierence between the intervention and control on a reading achievement test

    with a pooled standard deviation o 40 or those two samples is .25 standard deviation units, that is, an eect

    size o .25.

    Standardized mean dierence eect sizes are a useul way to characterize intervention eects or some

    purposes. Tis eect size metric, however, has very little more inherent meaning than the simple dierence

    between means; it simply transorms that dierence into standard deviation units. Interpreting the

    magnitude or practical signicance o an eect size requires that it be compared with appropriate criterion

    values or standards that are relevant and meaningul or the nature o the outcome variable, sample,

    and intervention condition on which it is based. We will have more to say about eect sizes and their

    interpretation later. We raise this matter now only to highlight a widely used but, nonetheless, misleading

    standard or assessing eect sizes and, at least by implication, their practical signicance.

    In his landmark book on statistical power, Cohen (1977, 1988) drew on his general impression o the

    range o eect sizes ound in social and behavioral research in order to create examples o power analysis or

    detecting smaller and larger eects. In that context, he dubbed .20 as small, .50 as medium, and .80 as

    large. Ever since, these values have been widely cited as standards or assessing the magnitude o the eects

    ound in intervention research despite Cohens own cautions about their inappropriateness or such general

    use. Cohen was attempting, in an unsystematic way, to describe the distribution o eect sizes one might

    nd i one piled up all the eect sizes on all the dierent outcome measures or all the dierent interventions

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    12/54

    4

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    targeting individual participants that were reported across the social and behavioral sciences. At that level

    o generality, one could take any given eect size and say it was in the low, middle, or high range o that

    distribution.

    Te problem with Cohens broad normative distribution or assessing eect sizes is not the idea o comparing

    an eect size with such norms. Later in this paper we will present some norms or eect sizes romeducational interventions and suggest doing just that. Te problem is that the normative distribution used as

    a basis or comparison must be appropriate or the outcome variables, interventions, and participant samples

    on which the eect size at issue is based. Cohens broad categories o small, medium, and large are clearly

    not tailored to the eects o intervention studies in education, much less any specic domain o education

    interventions, outcomes, and samples. Using those categories to characterize eect sizes rom education

    studies, thereore, can be quite misleading. It is rather like characterizing a childs height as small, medium,

    or large, not by reerence to the distribution o values or children o similar age and gender, but by reerence

    to a distribution or all vertebrate mammals.

    McCartney and Rosenthal (2000), or example, have shown that in intervention areas that involve hard

    to change low baserate outcomes, such as the incidence o heart attacks, the most impressively large

    eect sizes ound to date all well below the .20 that Cohen characterized as small. Tose small eects

    correspond to reducing the incidence o heart attacks by about halan eect o enormous practical

    signicance. Analogous examples are easily ound in education. For instance, many education intervention

    studies investigate eects on academic perormance and measure those eects with standardized reading

    or math achievement tests. As we show later in this paper, the eect sizes on such measures across a wide

    range o interventions are rarely as large as .30. By appropriate normsthat is, norms based on empirical

    distributions o eect sizes rom comparable studiesan eect size o .25 on such outcome measures is large

    and an eect size o .50, which would be only medium on Cohens all encompassing distribution, wouldbe more like huge.

    In short, comparisons o eect sizes in educational research with normative distributions o eect sizes to

    assess whether they are small, middling, or large relative to those norms should use appropriate norms.

    Appropriate norms are those based on distributions o eect sizes or comparable outcome measures rom

    comparable interventions targeted on comparable samples. Characterizing the magnitude o eect sizes

    relative to some other normative distribution is inappropriate and potentially misleading. Te widespread

    indiscriminate use o Cohens generic small, medium, and large eect size values to characterize eect sizes in

    domains to which his normative values do not apply is thus likewise inappropriate and misleading.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    13/54

    5

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Representing Effects Descriptively

    Te starting point or descriptive representations o the eects o an educational intervention is the set

    o native statistics generated by whatever analysis scheme has been used to compare outcomes or the

    participants in the intervention and control conditions. Tose statistics may or may not provide a valid

    estimate o the intervention eect. Te quality o that estimate will depend on the research design, sample

    size, attrition, reliability o the outcome measure, and a host o other such considerations. For purposes o

    this discussion, we assume that the researcher begins with a credible estimate o the intervention eect and

    consider only alternate representations or translations o the native statistics that initially describe that eect.

    A closely related alternative starting point or a descriptive representation o an intervention eect is the

    eect size estimate. Although the eect size statistic is not itsel much easier to interpret in practical terms

    than the native statistics on which it is based, it is useul or other purposes. Most notably, its standardized

    orm (i.e., representing eects in standard deviation units) allows comparison o the magnitude o eects

    on dierent outcome variables and across dierent studies. It is thus well worth computing and reportingin intervention studies but, or present purposes, we include it among the initial statistics or which an

    alternative representation would be more interpretable by most users.

    In the ollowing parts o this section o the paper, we rst provide advice or conguring the native statistics

    generated by common analyses in a orm appropriate or supporting alternate descriptive representations.

    We include in that discussion advice or conguring the eect size statistic as well in a ew selected situations

    that oten cause conusion.

    Conguring the Initial Statistics that Describe an InterventionEffect to Support Alternative Descriptive Representations

    Covariate Adjustments to the Means on the Outcome Variable

    Several o the descriptive representations o intervention eects described later are derived directly rom

    the means and perhaps the standard deviations on the outcome variable or the intervention and control

    groups. However, the observedmeans or the intervention and control groups may not be the best choice or

    representing an intervention eect. Te dierence between those means refects the eect o the intervention,

    to be sure, but it may also refect the infuence o any initial baseline dierences between the intervention

    and control groups. Te value o random assignment to conditions, o course, is that it permits only chancedierences at baseline, but this does not mean there will be no dierences, especially i the samples are not

    large. Moreover, attrition rom posttest measurement undermines the initial randomization so that estimates

    o eects may be based on subsets o the intervention and control samples that are not ully equivalent on

    their respective baseline characteristics even i the original samples were.

    Researchers oten attempt to adjust or such baseline dierences by including the respective baseline values

    as covariates in the analysis. Te most common and useul covariate is the pretest or the outcome measure

    along with basic demographic variables such as age, gender, ethnicity, socioeconomic status, and the like.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    14/54

    6

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Indeed, even when there are no baseline dierences to account or, the value o such covariates (especially the

    pretest) or increasing statistical power is so great that it is advisable to routinely include any covariates that

    have substantial correlations with the posttest (Rausch, Maxwell, Kelley 2003). With covariates included in

    the analysis, the estimation o the intervention eect is the dierence between the covariate-adjustedmeans

    o the intervention and control samples. Tese adjusted values better estimate the actual intervention eect

    by reducing any bias rom the baseline dierences and thus are the best choices or use in any descriptive

    representation o that eect. When that representation involves the standard deviations, however, their

    values should not be adjusted or the infuence o the covariates. In virtually all such instances, the standard

    deviations are used as estimates o the corresponding population standard deviations on the outcome

    variables without consideration or the particular covariates that may have been used in estimating the

    dierence on the means.

    When the analysis is conducted in analysis o covariance ormat (ANCOVA), most statistical sotware has an

    option or generating the covariate-adjusted means. When the analysis is conducted in multiple regression

    ormat, the unstandardized regression coecient or the intervention dummy code (intervention=1,control=0; or +0.5 vs. -0.5) is the dierence between the covariate-adjusted means. In education, analyses o

    intervention eects are oten multilevel when the outcome o interest is or students or teachers who, in turn

    are nested within classrooms, schools, or districts. Using multilevel regression analysis, e.g., HLM, does not

    change the situation with regard to the estimate o the dierence between the covariate-adjusted meansit

    is still the unstandardized regression coecient on the intervention dummy code. Te unadjusted standard

    deviations or the intervention and control groups, in turn, can be generated directly by most statistical

    programs, though that option may not be available within the ANCOVA, multiple regression, or HLM

    routine itsel.

    For binary outcomes, such as whether students are retained in grade, placed in special education status, orpass an exam, the analytic model is most oten logistic regression, a specialized variant o multiple regression

    or binary dependent variables. Te regression coecient () in a logistic regression or the dummy coded

    variable representing the experimental condition (e.g., 1=intervention, 0=control) is a covariate-adjusted log

    odds ratio representing the intervention eect (Crichton 2001). Unlogging it (exp) produces the covariate-

    adjusted odds ratio or the intervention eect, which can then be converted back into the terms o the

    original metric.

    For example, an intervention designed to improve the passing rate on an algebra exam might produce

    the results shown below. Te odds o passing or a given group are dened as the ratio o the number

    (or proportion) who pass to the number (or proportion) who ail. For the intervention group, thereore,the odds o passing are 45/15=3.0 and, or the control group, the odds are 30/30=1.0. Te odds ratio

    characterizing the intervention eect is the ratio o these two values, that is 3/1=3, and indicates that the

    odds o passing are three times greater or a student in the intervention group than or one in the control

    group.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    15/54

    7

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Passed Failed

    Intervention 45 15

    Control 30 30

    Suppose the researcher analyzes these outcomes in a logistic regression model with race, gender, and priormath achievement scores included as covariates to control or initial dierences between the two groups.

    I the coecient on the intervention variable in that analysis, converted to a covariate-adjusted odds-

    ratio, turns out to be 2.53, it indicates that the unadjusted odds ratio overestimated the intervention eect

    because o baseline dierences that avored the intervention group. With this inormation, the researcher

    can construct a covariate-adjusted version o the original 2x2 table that estimates the proportions o students

    passing in each condition when the baseline dierences are taken into account. o do this, the requencies

    or the control sample and the total N or the intervention sample are taken as given. We then want to know

    what passing requency, p, or the intervention group allows the odds ratio, (p30)/((60-p)30), to equal

    2.53. Solving orp reveals that it must be 43. Te covariate-adjusted results, thereore, are as shown below.

    Described as simple percentages, the covariate-adjusted estimate is that the intervention increased the 50%pass rate o the control condition to 72% (43/60) in the intervention condition.

    Passed Failed

    Intervention 43 17

    Control 30 30

    Identifying or Obtaining Appropriate Effect Size Statistics

    A number o the ways o representing intervention eects and assessing their practical signicance describedlater in this paper can be derived directly rom the standardized mean dierence eect size statistic,

    commonly reerred to simply as the eect size. Tis eect size is dened as the dierence between the mean

    o the intervention group and the mean o the control group on a given outcome measure divided by the

    pooled standard deviations or those two groups, as ollows:

    Where is the mean o the intervention sample on an outcome variable, Cis the mean o the controlsample on that variable, and sp is the pooled standard deviation. Te pooled standard deviation is obtained as

    the square root o the weighted mean o the two variances, dened as:

    where nand nCare the number o respondents in the intervention and control groups, and sand sCare

    the respective standard deviations on the outcome variable or the intervention and control groups.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    16/54

    8

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Te eect size is typically reported to two decimal places and, by convention, has a positive value when the

    intervention group does better on the outcome measure than the control group and a negative sign when it

    does worse. Note that this may not be the same sign that results rom subtraction o the control mean rom

    the intervention mean. For example, i low scores represent better perormance, e.g., as with a measure o the

    number o errors made, then subtraction will yield a negative value when the intervention group perorms

    better than the control, but the eect size typically would be given a positive sign to indicate the betterperormance o the intervention group.

    Eect sizes can be computed or estimated rom many dierent kinds o statistics generated in intervention

    studies. Inormative sources or such procedures include the What Works Clearinghouse Procedures and

    Standards Handbook(2011; Appendix B) and Lipsey and Wilson (2001; Appendix B). Here we will

    only highlight a ew eatures that may help researchers identiy or congure appropriate eect sizes or

    use in deriving alternative representations o intervention eects. Moreover, many o these eatures have

    implications or statistics other than the eect size that are involved in some representations o intervention

    eects.

    Clear understanding o what the numerator and denominator o the standardized mean dierence eect size

    represent will allow many common mistakes and conusions in the computation and interpretation o eect

    sizes to be avoided. Te numerator o the eect size estimates the dierence between the experimental groups

    on the means o the outcome variable that is attributable to the intervention. Tat is, the numerator should

    be the best estimate available o the mean intervention eect estimated in the units o the original metric.

    As described in the previous subsection, when researchers include baseline covariates in the analysis, the best

    estimate o the intervention eect is the dierence between the covariate-adjusted means on the outcome

    variable, not the dierence between the unadjusted means.

    Te purpose o the denominator o the eect size is to standardize the dierence between the outcome

    means in the numerator into metric ree standard deviation units. Te concept ostandardization is critical

    here. Standardization means that each eect size is represented in the same way, i.e., in a standard way,

    irrespective o the outcome construct, the way it is measured, or the way it is analyzed. Te sample standard

    deviations used or this purpose estimate the corresponding population standard deviations on the outcome

    measure. As such, the standard deviations should not be adjusted by any covariates that happened to be used

    in the design or analysis o the particular study. Such adjustments would not have general applicability to

    other designs and measures and thus would compromise the standardization that is the point o representing

    the intervention eect in standard deviation units. Tis means that the raw standard deviations or the

    intervention and control samples should be pooled into the eect size denominator, even when multilevelanalysis models with complex variance structures are used.

    Pooling the sample standard deviations or the intervention and control groups is intended to provide the

    best possible estimate o the respective population standard deviation by using all the data available. Tis

    procedure assumes that both those standard deviations estimate a common population standard deviation.

    Tis is the homogeneity o variance assumption typically made in the statistical analysis o intervention

    eects. I homogeneity o variance cannot be assumed, then consideration has to be given to the reason why

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    17/54

    9

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    the intervention and control group variances dier. In a randomized experiment, this should not occur on

    outcome variables unless the intervention itsel aects the variance in the intervention condition. In that

    case, the better estimate may be the standard deviation o the control group even though it is estimated on a

    smaller sample than the pooled version.

    In the multilevel situations common in education research, a related matter has to do with the populationthat is relevant or purposes o standardizing the intervention eect. Consider, or example, outcomes on an

    achievement test that is, or could be, used nationally. Te variance or the national population o students

    can be partitioned into between and within components according to the dierent units represented at

    dierent levels. Because state education systems dier, we might rst distinguish between-state and within-

    state variance. Within states, there would be variation between districts; within districts, there would

    be variation between schools; within schools, there would be variation between classrooms; and within

    classrooms, there would be variation between students. Te total variance or the national population can

    thus be decomposed as ollows (Hedges 2007):

    In an intervention study using a national sample, the sample estimate o the standard deviation includes all

    these components. Any eect size computed with that standard deviation is thus standardizing the eect size

    with the national population variance as the reerence value. Te standard deviation computed in a study

    using a sample o students rom a single classroom, on the other hand, estimates only the variance o the

    population o students who might be in that classroom in that school in that district in that state. In other

    words, this standard deviation does not include the between classroom, between school, between district,

    and between state components that would be included in the estimate rom a national sample. Similarly,

    an intervention study that draws its sample rom one school, or one district, will yield a standard deviationestimate that is implicitly using a narrower population as the basis or standardization than a study with a

    broader sample. Tis will not matter i there are no systematic dierences on the respective outcome measure

    between students in dierent states, districts, schools, and classrooms, i.e., those variance components are

    zero. With student achievement measures, we know this is generally not the case (e.g., Hedges and Hedberg

    2007). Less evidence is available or other measures used in education intervention studies, but it is likely

    that most o them also show nontrivial dierences between these dierent units and levels.

    Any researcher computing eect sizes or an intervention study or using them as a basis or alternative

    representations o intervention eects should be aware o this issue. Eect sizes based on samples o narrower

    populations will be larger than eect sizes based on broader samples even when the actual magnitudes o theintervention eects are identical. And, that dierence will be carried through to any other representation o

    the intervention eect that is based on the eect size. Compensating or that dierence, i appropriate, will

    require adding or subtracting estimates o the discrepant variance components, with the possibility that those

    components will have to be estimated rom sources outside the research sample itsel.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    18/54

    10

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Te discussion above assumes that the units on which the sample means and standard deviations are

    computed or an outcome variable are individuals, e.g., students. Te nested data structures common in

    education intervention studies, however, provide dierent units on which means and standard deviations

    can be computed, e.g., students, clusters o students in classrooms, and clusters o classrooms in schools.

    For instance, in a study o a whole school intervention aimed at improving student achievement, with

    some schools assigned to the intervention condition and others to the control, there are two eect sizes theresearcher could estimate. Te conventional eect size would standardize the intervention eect estimated

    on student scores using the pooled student level standard deviations. Alternatively, the student level scores

    might be aggregated to the school level and the school level means could be used to compute an eect size.

    Tat eect size would represent the intervention eect in standard deviation units that refect the variance

    between schools, not that between students. Te result is a legitimate eect size, but the school units on

    which it is based make this eect size dierent rom the more conventional eect size that is standardized on

    variation between individuals.

    Te numerators o these two eect sizes would not necessarily dier greatly. Te respective means o the

    student scores in the intervention and control groups would be similar to the means o the school-level

    means or those same students unless the number o students in each school diers greatly and is correlated

    with the school means. However, the standard deviations will be quite dierent because the variance

    between schools is only one component o the total variance between students. Between-school variance on

    achievement test scores is typically around 20-25% o the total variance, the intraclass correlation coecient

    (ICC) or schools (Hedges and Hedberg 2007). Te between schools standard deviation thus will be about

    = .50 or less o the student level standard deviation and the eect size based on school units will be

    about twice as large as the eect size based on students as the units even though both describe the same

    intervention eect.

    Similar situations arise in multilevel samples whenever the units on which the outcome is measured are

    nested within higher level clusters. Each such higher level cluster allows or its own distinctive eect size

    to be computed. A researcher comparing eect sizes in such situations or, more to the point or present

    purposes, using an eect size to derive other representations o intervention eects, must know which eect

    size is being used. An eect size standardized on a between-cluster variance component will nearly always be

    larger than the more conventional eect size standardized on the total variance across the lower level units

    on which the outcome was directly measured. Tat dierence in numerical magnitude will then be carried

    into any alternate representation o the intervention eect based on that eect size and the results must be

    interpreted accordingly.

    Descriptive Representations of Intervention Effects

    Representation in Terms of the Original Metric

    Beore looking at dierent ways o transorming the dierence between the means o the intervention

    and control samples into a dierent orm, we should rst consider those occasional situations in which

    dierences on the original metric are easily understood without such manipulations. Tis occurs when the

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    19/54

    11

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    units on the measure are suciently amiliar and well dened that little urther description or interpretation

    is needed. For example, an outcome measure or a truancy reduction program might be the proportion

    o days on which attendance was expected or which the student was absent. Te outcome score or each

    student, thereore, is a simple proportion and the corresponding value or the intervention or control

    groups is the mean proportion o days absent or the students in that group. Common events o this sort in

    education that can be represented as counts or proportions include dropping out o school, being expelledor suspended, being retained in grade, being placed in special education status, scoring above a prociency

    threshold on an achievement test, completing assignments, and so orth.

    Intervention eects on outcome measures that involve well recognized and easily understood events can

    usually be readily interpreted in their native orm by researchers, practitioners, and policymakers. Some

    caution is warranted, nevertheless, in presenting the dierences between intervention and control groups

    in terms o the proportions o such events. Dierences between proportions can have dierent implications

    depending on whether those dierences are viewed in absolute or relative terms. Consider, or example,

    a dierence o three percentage points between the intervention and control groups in the proportion

    suspended during the school year. Viewed in absolute terms, this appears to be a small dierence. But relative

    to the suspension rate or the control sample a three point decrease might be substantial. I the suspension

    rate or the control sample is only 5%, or instance, a decrease o three percentage points reduces that rate by

    more than hal. On the other hand, i the control sample has a suspension rate o 40%, a reduction o three

    percentage points might rightly be viewed as rather modest.

    In some contexts, the numerical values on an outcome measure that does not represent amiliar events may

    still be suciently amiliar that dierences are well-understood despite having little inherent meaning.

    Tis might be the case, or instance, with widely used standardized tests. For example, the Peabody Picture

    Vocabulary est (PPV; Dunn and Dunn 2007), one o the most widely used tests in education, is normedso that standard scores have a mean o 100 or the general population o children at any given age. Many

    researchers and educators have sucient experience with this test to understand what scores lower or

    higher than 100 indicate about childrens skill level and how much o an increase constitutes a meaningul

    improvement. Generally speaking, however, such amiliarity with the scoring o a particular measure o this

    sort is not widespread and most audiences will need more inormation to be able to interpret intervention

    eects expressed in terms o the values generated by an outcome measure.

    Intervention Efects in Relation to Pre-Post Change

    When pretest measures o an outcome variable are available, the pretest means may be used to provide an

    especially inormative representation o intervention eects using the original metric. Tis ollows rom the

    act that the intent o interventions is to bring about change in the outcome; that is, change between pretest

    and posttest. Te ull representation o the intervention eect, thereore, is not simply the dierence between

    the intervention and control samples on the outcome measure at posttest, but the dierential change

    between pretest and posttest on that outcome. By showing eects as dierential change, the researcher

    reveals not only the end result but the patterns o improvement or decline that characterize the intervention

    and control groups.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    20/54

    12

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Consider, or example, a program that instructs middle school students in confict resolution techniques

    with the objective o decreasing interpersonal aggression. Student surveys at the beginning and end o the

    school year are administered or intervention and control schools that provide composite scores or the

    amount o physical, verbal, and relational aggression students experience. Tese surveys show signicantly

    lower levels or the intervention schools than the control schools, indicating a positive eect o the confict

    resolution program, say a mean o 23.8 or the overall total score or the students in the interventionschools and 27.4 or the students in the control schools. Tat 3.6 point avorable outcome dierence,

    however, could have come rom any o a number o dierent patterns o change over the school year or

    the intervention and control schools. able 1 below shows some o the possibilities, all o which assume an

    eective randomization so that the pretest values at the beginning o the school year were virtually identical

    or the intervention and control schools.

    Table 1. Pre-post change differentials that result in the same posttest difference

    Scenario A Scenario B Scenario C

    Pretest Posttest Pretest Posttest Pretest Posttest

    Intervention 25.5 23.8 17.7 23.8 22.9 23.8

    Control 25.6 27.4 17.6 27.4 23.0 27.4

    As can be seen even more clearly in Figure 1, or Scenario A the aggression levels decreased somewhat in the

    intervention schools while increasing in the control schools. In Scenario B, the aggression levels increased

    quite a bit (at least relative to the intervention eect) in both samples, but the amount o the increase was

    not as great in the intervention schools as the control schools. In Scenario C, on the other hand, there was

    little change in the reported level o aggression over the course o the year in the intervention schools, but

    things got much worse during that time in the control schools. Tese dierent patterns o dierential pre-post change depict dierent trajectories or aggression absent intervention and give dierent pictures o

    what it is that the intervention accomplished. In Scenario A it reversed the trend that would have otherwise

    occurred. In Scenario B, it ameliorated an adverse trend, but did not prevent it rom getting worse. In

    Scenario C, the intervention did not produce appreciable improvement over time, but kept the amount o

    aggression rom getting worse.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    21/54

    13

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Figure 1. Pre-post change for the three scenarios with the same posttest difference

    Pretest Posttest

    31

    29

    27

    25

    23

    21

    19

    17

    15

    Intervention Control Intervention Control Intervention Control

    Scenario A

    Pretest Posttest

    31

    29

    27

    25

    23

    21

    19

    17

    15

    Scenario B

    Pretest Posttest

    31

    29

    27

    25

    23

    21

    19

    17

    15

    Scenario C

    As this example illustrates, a much uller picture o the intervention eect is provided when the dierence

    between the intervention and control samples on the outcome variable is presented in relation to where those

    samples started at the pretest baseline. A ner point can be put on the dierential change or the

    intervention and control groups, i desired, by proportioning the intervention eect against the control

    group pre-post change. In Scenario B above, or instance, the dierence between the control groups pretest

    and posttest composite aggression scores is 9.8 (27.4 - 17.6) while the posttest dierence between the

    intervention and control group is -3.6 (23.8 - 27.4). Te intervention, thereore, reduced the pre-post

    increase in the aggression score by 36.7% (-3.6/9.8).

    Overlap Between Intervention and Control Distributions

    I the distributions o scores on an outcome variable were plotted separately or the intervention and

    control samples, they might look something like Figure 2 below. Te magnitude o the intervention eect

    is represented directly by the dierence between the means o the two distributions. Te standardized

    mean dierence eect size, discussed earlier, also represents the dierence between the two means, but

    does so in standard deviation units. Still another way to represent the dierence between the outcomes

    or the intervention and control groups is in terms o the overlap between their respective distributions.

    When the dierence between the means is larger, the overlap is smaller; when the dierence between the

    means is smaller, the overlap is larger. Te amount o overlap, in turn, can be described in terms o the

    proportion o individuals in each distribution who are above or below a specied reerence point on oneo the distributions. Proportions o this sort are oten easy to understand and appraise and, thereore, may

    help communicate the magnitude o the eect. Various ways to take advantage o this circumstance in the

    presentation o intervention eects are described below.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    22/54

    14

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Figure 2. Intervention and control distributions on an outcome variable

    Intervention Efects Represented in Percentile Form

    For outcomes assessed with standardized measures that have norms, a simple representation o the

    intervention eect is to characterize the means o the control and intervention samples according to the

    percentile values they represent in the norming distribution. For a normed, standardized measure, these

    values are oten provided as part o the scoring scheme or that measure. On a standardized math reading

    achievement test, or instance, suppose that the mean o the control sample ell at the 47th percentile

    according to the test norms or the respective age group and the mean o the intervention sample ell at

    the 52nd percentile. Tis tells us, rst, that the mean outcome or the control sample is somewhat below

    average perormance (50th percentile) relative to the norms and, second, that the eect o the intervention

    was to improve perormance to the point where it was slightly above average. In addition, we see that

    the corresponding increase was 5 percentile points. Tat 5 percentile point dierence indicates that the

    individuals receiving the intervention, on average, have now caught up with the 5% o the norming

    population that otherwise scored just above them.

    In their study o each or America, Decker, Mayer, and Glazerman (2004) used percentiles in this way to

    characterize the statistically signicant eect they ound on student math achievement scores. Decker et al.

    also reported the pretest means as percentiles so that the relative gain o the intervention sample was evident.

    Tis representation revealed that the students in the control sample were at the 15th percentile at both thepretest and posttest whereas the intervention sample gained 3 percentiles by moving rom the 14th to the

    17th percentile.

    It should be noted that the percentile dierences on the norming distribution that are associated with a

    given dierence in scores will vary according to where the scores all in the distribution. able 2 shows the

    percentile levels or the mean score o the lower scoring experimental group (e.g., control group when its

    mean score is lower than that o the treatment group) in the rst column. Te numbers in the body o the

    table then show the corresponding percentile level o the other group (e.g., treatment) that are associated

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    23/54

    15

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    with a range o score dierences represented in standard deviation units (which thereore means these

    dierences can also be interpreted as standardized mean dierence eect sizes). As shown there, i one group

    scores at the 50th percentile and the other has a mean score that is .50 standard deviations higher, that

    higher group will be at the 69th percentile or a dierence o 19 in the percentile ranking. A .50 standard

    deviation dierence between a group scoring at the 5th percentile and a higher scoring group will put that

    other group at the 13th percentile or a dierence o only 8 in the percentile ranking. Tis same pattern ordierences between two groups also applies to pre-post gains or one group. Researchers should thereore be

    aware that intervention eects represented as percentile dierences or gains on a normative distribution can

    look quite dierent depending on whether the respective scores all closer to the middle or the extremes o

    the distribution.

    Table 2. Upper percentiles for selected differences or gains from a lower percentile

    LowerPercentile

    Dierence or Gain in Standard Deviations

    .10 .20 .50 .80 1.00

    5th 6 7 13 20 26

    10th 12 14 22 32 39

    15th 17 20 30 41 48

    25th 28 32 43 54 62

    50th 54 58 69 79 84

    75th 78 81 88 93 95

    85th 87 89 94 97 98

    90th 92 93 96 98 99

    95th 96 97 98 99 99NOTE: Table adapted from Albanese (2000).

    A similar use o percentiles can be applied to the outcome scores in the intervention and control groups

    when those scores are not reerenced to a norming distribution. Te distribution o scores or the control

    group, which represents the situation in the absence o any infuence rom the intervention under study,

    can play the role o the norming distribution in this application. Te proportion o scores alling below the

    control group and intervention group means can then be transormed into the corresponding percentile

    values on the control distribution. Tese values can be obtained rom the cumulative requency tables that

    most statistical analysis computer programs readily produce or the values on any variable. For a symmetrical

    distribution, the mean o the control sample will be at the 50th percentile (the median). Te mean scoreor the intervention sample can then be represented in terms o its percentile value on that same control

    distribution. Tus we may nd that the mean or the intervention group alls at the 77th percentile o the

    control distribution, indicating that its mean is now higher than 77% o the scores in the control sample.

    With a control group mean at the 50th percentile, another way o describing the dierence is that the

    intervention has moved 27% o the sample rom a score below the control mean to one above that mean.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    24/54

    16

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Tis comparison is shown in Figure 3.

    Figure 3. Percentile values on the control distribution of the means of the control and

    intervention groups

    Intervention Efects Represented as Proportions Above or Below a Reerence Value

    Te representation o an intervention eect as the percentiles on a reerence distribution, as described

    above, is based on the proportions o the respective groups above and below a specic threshold value on

    that reerence distribution. A useul variant o this approach is to select an inormative threshold value on

    the control distribution and depict the intervention eect in terms o the proportion o intervention cases

    above or below that value in comparison to the corresponding proportions o control cases. Te result then

    indicates how many more o the intervention cases are in the desirable range dened by that threshold thanexpected without the intervention.

    When available, the most meaningul threshold value or comparing proportions o intervention and control

    cases is one externally dened to have substantive meaning in the intervention context. Such threshold

    values are oten dened or criterion-reerenced tests. For example, thresholds have been set or the National

    Assessment o Educational Progress (NAEP) achievement tests with cuto scores that designate Basic,

    Proicient, andAdvancedlevels o perormance. On the NAEP math achievement test, or instance, scores

    between 299 and 333 are identied as indicating that 8th grade students are procient. I we imagine that

    we might assess a math intervention using the NAEP test, we could compare the proportion o students

    in the intervention versus control conditions who scored 300 or above

    that is, were at least minimallyprocient. Figure 4 shows what the results might look like. In this example, 36% o the control students

    scored above that threshold level whereas 45% o the intervention students did so.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    25/54

    17

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Figure 4. Proportion of the control and intervention distributions scoring above an externally

    dened prociency threshold score

    Similar thresholds might be available rom the norming data or a standardized measure. For example, the

    mean standard score or the Peabody Picture Vocabulary test (PPV; Dunn and Dunn 2007) is 100, which

    is, thereore, the mean age-adjusted score in the norming sample. Assuming representative norms, that scorerepresents the population average or children o any given age. For an intervention with the PPV as an

    outcome measure, the intervention eect could be described in terms o the proportion o children in the

    intervention versus control samples scoring 100 or above. I the proportion or either sample is at least .50, it

    tells us that their perormance is average or their age. Suppose that or a control group, 32% scored 100 or

    above at posttest, identiying them as a low perorming sample. I 38% o the intervention group scored 100

    or above, we see that the eect o the intervention has been to move 6% o the children rom the below

    average to the above average range. At the same time, we see that this has not been sucient to close the gap

    between them and normative perormance on this measure.

    With a little eort, researchers may be able to identiy meaningul threshold values or measures that donot already have one dened. Consider a multi-item scale on which teachers rate the problem behavior o

    students in their classrooms. When pretest data are collected on this measure, the researcher might also ask

    each teacher to nominate several children who are borderlinenot presenting signicant behavior problems,

    but close to that point. Te scores o those children could then be used to identiy the approximate point on

    the rating scale at which teachers begin to view the classroom behavior o a child as problematic. Tat score

    then provides a threshold value that allows the researcher to describe the eects o, say, a classroom behavior

    management program in terms o how many ewer students in the intervention condition than the control

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    26/54

    18

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    condition all in the problem range. Dierences on the means o an arbitrarily scored multi-item rating scale,

    though critical to the statistical analysis, are not likely to convey the magnitude o the eect as graphically as

    this translation into proportions o children above a threshold teachers themselves identiy.

    Absent a substantively meaningul threshold value, an inormative representation o the intervention eect

    might still be provided with a generic threshold value. Cohen (1988), or instance, used the control groupmean as a general threshold value to create an index he called U3, one o several indices he proposed to

    describe the degree o non-overlap between control and intervention distributions. Te example shown in

    Figure 3, presented earlier to illustrate the use o percentile values, similarly made the control group mean

    the key reerence value.

    With the actual scores in hand or the control and intervention groups, it is straightorward or a researcher

    to determine the proportion o each above (or below) the control mean. Assuming normal distributions,

    those proportions and the corresponding percentiles or the control and intervention means can easily

    be linked to the standardized mean dierence eect size through a table o areas under the normal curve.

    Te mean o a normally distributed control sample is at the 50th percentile with a z-score o zero. Adding

    the standardized mean dierence eect size to that z-score then identies the z-score o the intervention

    mean on the control distribution. With a table o areas under the normal curve, that z-score, in turn, can

    be converted to the equivalent percentile and proportions in the control distribution. able 3 shows the

    proportion o intervention cases above the control sample mean or dierent standardized mean dierence

    eect size values, assuming normal distributionsCohens (1988) U3 index. In each case, the increase over

    .50 indicates the additional proportion o the cases that the intervention has pushed above that control

    condition mean.

    Rosenthal and Rubin (1982) described yet another generic threshold or comparing the relative proportionso the control and intervention groups attaining it within a ramework they called the Binomial Eect

    Size Display (BESD). In this scheme, the key success threshold value is the grand median o the combined

    intervention and control distributions. When there is no intervention eect, the means o both the

    intervention and control distributions all at that grand median. As the intervention eect gets larger and

    the intervention and control distributions separate, smaller proportions o the control distribution and larger

    proportions o the intervention distribution all above that grand median. Figure 5 depicts this situation.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    27/54

    19

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Table 3. Proportion of intervention cases above the mean of the control distribution

    Eective SizeProportion above the

    Control Mean Eect SizeProportion abovethe Control Mean

    .10 .54 1.30 .90

    .20 .58 1.40 .92

    .30 .62 1.50 .93

    .40 .66 1.60 .95

    .50 .69 1.70 .96

    .60 .73 1.80 .96

    .70 .76 1.90 .97

    .80 .79 2.00 .98

    .90 .82 2.10 .98

    1.00 .84 2.20 .99

    1.10 .86 2.30 .99

    1.20 .88 2.40 .99

    Figure 5. Binomial effect size displayProportion of cases above and below the grand median

    Using the grand median as the threshold value makes the proportion o the intervention sample above the

    threshold value equal to the proportion o the control sample below that value. Te dierence between these

    proportions, which Rosenthal and Rubin called the BESD Index, indicates how many more intervention

    cases are above the grand median than control cases. Assuming normal distributions, the BESD can also be

    linked to the standardized mean dierence eect size. An additional and occasionally convenient eature o

    the BESD is that it is equal to the eect size expressed as a correlation; that is, the correlation between the

    treatment variable (coded as 1 vs. 0) and the outcome variable. Many researchers are more amiliar with

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    28/54

    20

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    correlations than standardized mean dierences, so the magnitude o the eect expressed as a correlation may

    be somewhat more interpretable or them. able 4 shows the proportions above and below the grand median

    and the BESD as the intervention eect sizes get larger. It also shows the corresponding correlational

    equivalents or each eect size and BESD.

    Table 4. Relationship of the effect size and correlation coefcient to the BESD

    Eective Size r

    Proportion o control/intervention cases above

    the grand median

    BESD(dierence between the

    proportions)

    .10 .05 .47 / .52 .05

    .20 .10 .45 / .55 .10

    .30 .15 .42 / .57 .15

    .40 .20 .40 / .60 .20

    .50 .24 .38 / .62 .24

    .60 .29 .35 / .64 .29

    .70 .33 .33 / .66 .33

    .80 .37 .31 / .68 .37

    .90 .41 .29 / .70 .41

    1.00 .45 .27 / .72 .45

    1.10 .48 .26 / .74 .48

    1.20 .51 .24 / .75 .51

    1.30 .54 .23 / .77 .54

    1.40 .57 .21 / .78 .57

    1.50 .60 .20 / .80 .60

    1.60 .62 .19 / .81 .62

    1.70 .65 .17 / .82 .65

    1.80 .67 .16 / .83 .67

    1.90 .69 .15 / .84 .69

    2.00 .71 .14 / .85 .71

    2.10 .72 .14 / .86 .72

    2.20 .74 .13 / .87 .742.30 .75 .12 / .87 .75

    2.40 .77 .11 / .88 .77

    All the variations on representing the proportions o the intervention and control group distributions

    above or below a threshold value require dichotomizing the respective distributions o scores. It should be

    noted that we are not advocating that the statistical analysis be conducted on any such dichotomized data.

    It is well known that such crude dichotomizations discard useul data and generally weaken the analysis

    (Cohen 1983; MacCallum et al. 2002). What is being suggested is that, ater the ormal statistical analysis

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    29/54

    21

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    is done and the results are known, depicting intervention eects in one o the ways described here may

    communicate their magnitude and practical implications better than means, standard deviations, t-test

    values, and the other native statistics that result directly rom the analysis.

    In applying any o these techniques, some consideration should be given to the shape o the respective

    distributions. When the outcome scores are normally distributed, the application o these techniques isrelatively tidy and straightorward. When the data are not normally distributed, the respective empirical

    distributions can always be dichotomized to determine what proportions o cases are above or below any

    given reerence value o interest, but the linkage between those proportions and other representations may

    be problematic or misleading. Percentile values, and dierences in those values, or instance, take on quite

    a dierent character in skewed distributions with long tails than in normal distributions, as do the standard

    deviation units in which standardized mean dierence eect sizes are represented.

    Standard Scores and Normal Curve Equivalents (NCE)

    Standard scores are a conversion o the raw scores on a norm reerenced test that draws upon the normingsample used by the test developer to characterize the distribution o scores expected rom the population or

    which the test is intended. A linear transorm o the raw scores is applied to produce tidier numbers or the

    mean and standard deviation. For many standardized measures, or instance, the standard score mean may

    be set at 100 with a standard deviation o 15.

    Presenting intervention eects in terms o standard scores can make those eects easier to understand in

    some regards. For example, the mean scores or the intervention and control groups can be easily assessed

    in relation to the mean or the norming sample. Mean scores below the standardized mean score, e.g., 100,

    indicate that the sample, on average, scores below the mean or the population represented in the norming

    sample. Similarly, a standard score mean o, say, 95 or the control group and 102 or the intervention groupindicates that the eect o the intervention was to improve the scores o an underperorming group to the

    point where their scores were more typical o the average perormance o the norming sample.

    An important characteristic o standard scores or tests and measures used to assess student perormance is

    that those scores are typically adjusted or the age o the respective students. Te population represented in

    the norming sample rom which the standard scores are derived is divided into age or school grade groups

    and the standard scores are determined or each group. Tus the standard scores or, say, the students in the

    norming sample who are in the ourth grade and average 9 years o age may be scaled to have a mean o 100

    and a standard deviation o 15, but so will the standard scores or the students in the sixth grade with an

    average age o 11 years. Dierent standardized measures may use dierent age groups or this purpose, e.g.,

    diering by as little as a month or two or as much as a year or more.

    Tese age adjustments o standard scores have implications or interpreting changes in those scores over time

    because those changes are depicted relative to the change or same aged groups in the norming sample. A

    control sample with a mean standard score o 87 on the pretest and a mean score o 87 on the posttest a year

    later has not ailed to make gains but, rather, has simply kept abreast o the dierences by age in the norming

    sample. On the other hand, an intervention group with a mean pretest standard score o 87 and mean

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    30/54

    22

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    posttest score o 95 has improved at a rate aster than that represented in the comparable age dierences in

    the norming sample. Tis characteristic allows some interpretation o the extent to which intervention eects

    accelerate growth, though that depends heavily on the assumption that the sample used in the intervention

    study is representative o the norming sample used by the test developer.

    Reporting intervention eects in standard score units thus has some modest advantages or interpretabilitybecause o the implicit comparison with the perormance o the norming sample. Moreover, the means and

    standard deviations or standard scores are usually assigned simple round numbers that are easy to remember

    when making such comparisons. In other ways standard scores are not so tidy. Most notably, standard scores

    typically have a rather odd range. With a normal distribution encompassing more than 99% o the scores

    within 3 standard deviations, standard scores with a mean o 100 and a standard deviation o 15 will range

    rom about 55 at the lowest to about 145 at the highest. Tese are not especially intuitive numbers or the

    bottom and top o a measurement scale. For this reason, researchers may preer to represent treatment eects

    in terms o some variant o standard scores. One such variant that is well known in education is the normal

    curve equivalent.

    Normal curve equivalents. Normal curve equivalents (NCE) are a metric developed in 1976 or the U.S.

    Department o Education or reporting scores on norm-reerenced tests and allowing comparison across tests

    (Hills 1984; allmadge and Wood 1976). NCE scores are standard scores based on an alternative scaling o

    the z-scores or measured values in a normal distribution derived rom the norming sample or the measure.

    Unlike the typical standard score, as described above, NCE scores are scaled so that they range rom a low

    around 0 to a high o around 100, with a mean o 50. NCE scores, thereore, allow scores, dierences in

    scores, and changes in scores to be appraised on a 100 point scale that starts at zero.

    NCE scores are computed by rst transorming the original raw scores into normalized z-scores. Te z-scoreis the original score minus the mean or all the scores divided by the standard deviation; it indicates the

    number o standard deviations above or below a mean o zero that the score represents. Te NCE score is

    then computed as NCE = 21.06(z-score) + 50; that is, 21.06 times the z-score plus 50. Tis produces a set

    o NCE scores with a mean o 50 and a standard deviation o 21.06. Note that the standard deviation or

    NCE scores is not as tidy as the round number typically used or other standard scores, but it is required to

    produce the other desirable characteristics o NCE scores.

    As a standard score, NCEs are comparable across all the measures that derive and provide NCE scores

    rom their norming samples i those samples represent the same population. Tus while a raw score o 82

    on a particular reading test would not be directly comparable to the same numerical score on a dierentreading test measuring the same construct but scaled in a dierent way (i.e., a dierent mean and standard

    deviation), the corresponding NCE scores could be compared. For example, i the NCE score corresponding

    to 82 on the rst measure was 68 and that corresponding to 82 on the second measure was 56, we could

    rightly judge that the rst students reading perormance was better than that o the second student.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    31/54

    23

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    When NCE scores are available or can be derived rom the scoring scheme or a normed measure, using

    them to report the pretest and posttest means or the intervention and control samples may help readers

    better understand the nature o the eects. It is easier to judge the dierence between the intervention and

    control means when the scores are on a 0-100 scale than when they are represented in a less intuitive metric.

    Tus a 5-point dierence on a 0-100 scale might be easier to interpret than a 5-point dierence on a raw

    score metric that ranges rom, e.g., 143 to 240. NCE scores also preserve the advantage o standard scoresdescribed above o allowing implicit comparisons with the perormance o the norming sample. Tus mean

    scores over 50 show better perormance than the comparable norming sample and mean scores under 50

    show poorer perormance.

    Although standard scores, and NCE scores in particular, oer a number o advantages as a metric with

    which to describe intervention eects, they have several limitations. First, standard scores are all derived

    rom the norming sample obtained by the developer o the measure. Tus these scores assume that sample

    is representative o the population o interest to the intervention study and that the samples in the study,

    in turn, are representative o the norming sample. Tese assumptions could easily be alse or intervention

    studies that ocus on populations distinctly appropriate or the intervention o interest. Similar discrepancies

    could arise or any age-adjusted standard score i the norming measures and the intervention measures

    were administered at very dierent times during the school yeardierences could then be the result o

    predictable growth over the course o that year (Hills 1984).

    Grade Equivalent Scores

    A grade equivalent (GE) is a developmental score reported or many norm-reerenced tests that characterizes

    students achievement in terms o the grade level o the students in the normative sample with similar

    perormance on that test. Grade equivalent scores are based on the nine-month school year and are

    represented in terms o the grade level and number o ull months within a nine-month school year. A

    GE score thus corresponds to the mean level o perormance at a certain point in time in the school year

    or a given grade. Te grade level is represented by the rst number in the GE score and the month o the

    school year ollows ater a period with months ranging rom a value o 0 (September) to 9 (June). A GE o

    6.2, or example, represents the score that would be achieved by an average student in the sixth grade ater

    completion o the second ull month o school. Te dierence between GE scores o 5.2 (November o grade

    5) and 6.2 (November o grade 6) represents one calendar years growth or change in perormance.

    Te GE score or an individual student in a given grade, or the mean or a sample o students, is inherently

    comparative. A GE score that diers rom the grade level a student is actually in indicates perormancebetter or worse than that o the average students at that same grade level in the norming sample. I the mean

    GE or a sample o students tested near the end o the ourth grade is 5.3, or instance, these students are

    perorming at the average level o students tested in December o the th grade in the norming sample; that

    is, they are perorming better than expected or their actual grade level. Conversely, i their mean GE is 4.1,

    they are perorming below what is expected or their actual grade level. Tese comparisons, o course, assume

    that the norming sample is representative o the population rom which the research sample is drawn.

  • 7/30/2019 Translating the statistical representation of the effects of education interventions

    32/54

    24

    Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

    Intervention eects are represented in terms o GE scores simply as the dierence between the intervention

    and control sample means expressed in GE units. For example, in a study o Success or All (SFA), a

    comprehensive reading program, Slavin and colleagues (1996) compared the mean GE scores or the

    intervention and control samples on a reading measure used as a key outcome variable. Tough the numerical

    dierences in mean GE scores between the samples were not reported, Figure 6 shows the approximate

    magnitudes o those dierences. Te ourth grade students in SFA, or instance, scored on average about 1.8GE ahead o the control sample, indicating that their perormance was closer to that o the mean or ourth

    graders in the norming sample than that o the control group. Note also that the mean GE or each sample

    taken by itsel identies the groups perormance level relative to the normative sample. Te ourth grade

    control sample mean, at 3.0, indicates that, on average, these students were not perorming up to the grade

    level mean in the norming sample whereas the SFA sample, by comparison, was almost to grade level.

    Figure 6. Mean reading grade equivalent (GE) scores of success for all and control samples

    [Adapted from Slavin et al. 1996]

    Grade

    MeanReadingGEScore

    5

    4.5

    4

    3.5

    3

    2.5

    2

    1.5

    1

    SFA Control

    Te GE score is oten used to communicate with educators and parents because o its simplicity and inherent

    meaningulness. It makes perormance relative to the norm and the magnitude o intervention eects easy to

    understand. Furthermore, when used to index change over time, the GE score is an intuitive way to

    represent growth in a students achievement. Te simplic