Translating the statistical representation of the effects of education interventions

7/30/2019 Translating the statistical representation of the effects of education interventions

1/54

NCSER 2013-300

U.S. DEPARTMENT OF EDUCATION

Translating the Statistical

Representation of the Effects of

Education Interventions Into MoreReadily Interpretable Forms

NOVEMBER 2012

Mark W. Lipsey, Kelly Puzio, Cathy Yun, Michael A. Hebert, Kasia Steinka-Fry,

Mikel W. Cole, Megan Roberts, Karen S. Anthony, and Matthew D. Busick


2/54

Page intentionally let blank.


3/54

Translating the StatisticalRepresentation of the Effects ofEducation Interventions Into More

Readily Interpretable Forms

NOVEMBER 2012

Mark W. LipseyPeabody Research InstituteVanderbilt University

Kelly PuzioDepartment o eaching and LearningWashington State University

Cathy YunVanderbilt University

Michael A. HebertDepartment o Special Education and Communication DisordersUniversity o Nebraska-Lincoln

Kasia Steinka-FryPeabody Research InstituteVanderbilt University

Mikel W. ColeEugene . Moore School o EducationClemson University

Megan RobertsHearing & Speech Sciences DepartmentVanderbilt University

Karen S. AnthonyVanderbilt University

and

Matthew D. Busick

Vanderbilt University

NCSER 2013-3000

U.S. DEPARTMENT OF EDUCATION


4/54

iv

Tis report was prepared or the National Center or Special Education Research, Institute o Education

Sciences under Contract ED-IES-09-C-0021.

Disclaimer

Te Institute o Education Sciences (IES) at the U.S. Department o Education contracted with Command

Decisions Systems & Solutions to develop a report that assists with the translation o eect size statistics

into more readily interpretable orms or practitioners, policymakers, and researchers. Te views expressedin this report are those o the author and they do not necessarily represent the opinions and positions o the

Institute o Education Sciences or the U.S. Department o Education.

U.S. Department o Education

Arne Duncan, Secretary

Institute o Education Sciences

John Q. Easton, Director

National Center or Special Education ResearchDeborah Speece, Commissioner

November 2012

Tis report is in the public domain. While permission to reprint this publication is not necessary, the

citation should be: Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, M.W., Roberts,

M., Anthony, K.S., Busick, M.D. (2012). ranslating the Statistical Representation o the Eects o

Education Interventions into More Readily Interpretable Forms. (NCSER 2013-3000). Washington, DC:

National Center or Special Education Research, Institute o Education Sciences, U.S. Department o

Education. Tis report is available on the IES website at http://ies.ed.gov/ncser/.

Alternate Formats

Upon request, this report is available in alternate ormats such as Braille, large print, audiotape, or computer

diskette. For more inormation, please contact the Departments Alternate Format Center at 202-260-9895

or 202-205-8113.

Disclosure of Potential Conicts of Interest

Tere are nine authors or this report with whom IES contracted to develop the discussion o the issues

presented. Mark W. Lipsey, Cathy Yun, Kasia Steinka-Fry, Megan Roberts, Karen S. Anthony, and MatthewD. Busick are employees or graduate students at Vanderbilt University; Kelly Puzio is an employee at

Washington State University; Michael A. Hebert is an employee at University o Nebraska-Lincoln; and

Mikel W. Cole is an employee at Clemson University. Te authors do not have nancial interests that could

be aected by the content in this report.
http://ies.ed.gov/ncser/http://ies.ed.gov/ncser/


5/54

v

Contents

List o Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List o Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Organization and Key Temes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Inappropriate and Misleading Characterizations o the Magnitude o Intervention Eects . . . . . . 3

Representing Eects Descriptively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Conguring the Initial Statistics that Describe an Intervention Eect to Support

Alternative Descriptive Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

Covariate Adjustments to the Means on the Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

Identiying or Obtaining Appropriate Eect Size Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

Descriptive Representations o Intervention Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Representation in erms o the Original Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Standard Scores and Normal Curve Equivalents (NCE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

Grade Equivalent Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

Assessing the Practical Signifcance o Intervention Eects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Benchmarking Against Normative Expectations or Academic Growth . . . . . . . . . . . . . . . . . . . . . .26

Benchmarking Against Policy-Relevant Perormance Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

Benchmarking Against Dierences Among Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

Benchmarking Against Dierences Among Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

Benchmarking Against the Observed Eect Sizes or Similar Interventions . . . . . . . . . . . . . . . . . . .33

Benchmarking Eects Relative to Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

Calculating otal Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

Cost-eectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Cost-beneft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

Reerences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


6/54

vi

List of Tables

1. Pre-post change dierentials that result in the same posttest dierence . . . . . . . . . . . . . . . . . . . .12

2. Upper percentiles or selected dierences or gains rom a lower percentile . . . . . . . . . . . . . . . . .15

3. Proportion o intervention cases above the mean o the control distribution. . . . . . . . . . . . . . . .19

4. Relationship o the eect size and correlation coecient to the BESD . . . . . . . . . . . . . . . . . . . .20

5. Annual achievement gain: Mean eect sizes across seven nationally-normed tests . . . . . . . . . . . .28

6. Demographic perormance gaps on mean NAEP scores as eect sizes. . . . . . . . . . . . . . . . . . . . .30

7. Demographic perormance gaps on SA 9 scores in a large urban school

district as eect sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

8. Perormance gaps between average and weak schools as eect sizes . . . . . . . . . . . . . . . . . . . . . . .32

9. Achievement eect sizes rom randomized studies broken out by type o test

and grade level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

10. Achievement eect sizes rom randomized studies broken out by type o interventionand target recipients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

11. Estimated costs o two ctional high school interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

12. Cost-eectiveness estimates or two ctional high school interventions. . . . . . . . . . . . . . . . . . . .41

Table Page


7/54

vii

List of Figures

1. Pre-post change or the three scenarios with the same posttest dierence . . . . . . . . . . . . . . . . . .13

2. Intervention and control distributions on an outcome variable. . . . . . . . . . . . . . . . . . . . . . . . . .14

3. Percentile values on the control distribution o the means o the control and

intervention groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

4. Proportion o the control and intervention distributions scoring above an

externally dened prociency threshold score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

5. Binomial eect size displayProportion o cases above and below the

grand median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

6. Mean reading grade equivalent (GE) scores o success or all and control samples

[Adapted rom Slavin et al. 1996] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

Figure Page


8/54

Page intentionally let blank.


9/54

1

Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms

Introduction

Te superintendent o an urban school district reads an evaluation o the eects o a vocabulary building

program on the reading ability o fth graders in which the primary outcome measure was the CA/5

reading achievement test. Te mean posttest score or the intervention sample was 718 compared to 703

or the control sample. Te vocabulary building program thus increased reading ability, on average, by15 points on the CA/5. According to the report, this dierence is statistically signifcant, but is this a

big eect or a trivial one? Do the students who participated in the program read a lot better now, or just

a little better? I they were poor readers beore, is this a big enough eect to now make them profcient

readers? I they were behind their peers, have they now caught up?

Knowing that this intervention produced a statistically signicant positive eect is not particularly helpul to

the superintendent in our story. Someone intimately amiliar with the CA/5 (Caliornia Achievement est,

5th edition; CB/McGraw Hill 1996) and its scoring may be able to look at these means and understand

the magnitude o the eect in practical terms but, or most o us, these numbers have little inherentmeaning. Tis situation is not unusualthe native statistical representations o the ndings o studies o

intervention eects oten provide little insight into the practical magnitude and meaning o those eects. o

communicate that important inormation to researchers, practitioners, and policymakers, those statistical

representations must be translated into some orm that makes their practical signicance easier to iner. Even

better would be some ramework or directly assessing their practical signicance.

Tis paper is directed to researchers who conduct and report education intervention studies. Its purpose

is to stimulate and guide them to go a step beyond reporting the statistics that emerge rom their analysis

o the dierences between experimental groups on the respective outcome variables. With what is oten

very minimal additional eort, those statistical representations can be translated into orms that allow theirmagnitude and practical signicance to be more readily understood by the practitioners, policymakers, and

even other researchers who are interested in the intervention that was evaluated.

Organization and Key Themes

Te primary purpose o this paper is to provide suggestions to researchers about ways to present statistical

ndings about the eects o educational interventions that might make the nature and magnitude o those

eects easier to understand. Tese suggestions and the related discussion are ramed within the context o

studies that use experimental designs to compare measured outcomes or two groups o participants, one

in an intervention condition and the other in a control condition. Tough this is a common and, in many

ways, prototypical orm or studies o intervention eects, there are other important orms. Tough not

addressed directly, much o what is suggested here can be applied with modest adaptation to experimental

studies that compare outcomes or more than two groups or compare conditions that do not include

a control (e.g., compare dierent interventions), and to quasi-experiments that compare outcomes or

nonrandomized groups. Other kinds o intervention studies that appear in educational research are beyond

the scope o this paper. Most notable among those other kinds are observational studies, e.g., multivariate


10/54

2


analysis o the relationship across schools between natural variation in per pupil unding and student

achievement, and single case research designs such as those oten used in special education contexts to

investigate the eects o interventions or children with low-incidence disabilities.

Te discussion in the remainder o this paper is divided into three main sections, each addressing a relatively

distinct aspect o the issue. Te rst section examines two common, but inappropriate and misleading waysto characterize the magnitude o intervention eects. Its purpose is to caution researchers about the problems

with these approaches and provide some context or consideration o better alternatives.

Te second section reviews a number o ways to represent intervention eects descriptively. Te ocus there

is on how to better communicate the nature and magnitude o the eect represented by the dierence on an

outcome variable between the intervention and control samples. For example, it may be possible to express

that dierence in terms o percentiles or the contrasting proportions o intervention and control participants

scoring above a meaningul threshold value. Represented in terms such as those, the nature and magnitude

o the intervention eect may be more easily understood and appreciated than when presented as means,

regression coecients,p-values, standard errors, and the like.

Te point o departure or descriptive representations o an intervention eect is the set o statistics

generated by whatever analysis the researcher uses to estimate that eect. Most relevant are the means and,

or some purposes, the standard deviations on the outcome variable or the intervention and control groups.

Alternatively, the point o departure might be the eect size estimate, which combines inormation rom the

group means and standard deviations and is an increasingly common and requently recommended way to

report intervention eects. However, not every analysis routine automatically generates the statistics that are

most appropriate or directly deriving alternative descriptive representations or or computing the eect size

statistic as an intermediate step in deriving such representations. Tis second section o the paper, thereore,begins with a subsection that provides advice about obtaining the basic statistics that support the various

representations o intervention eects that are described in the subsections that ollow it.

Te third section o this paper sketches some approaches that might be used to go beyond descriptive

representations to more directly reveal the practical signicance o an intervention eect. o accomplish that,

the observed eect must be assessed in relationship to some externally dened standard, target, or rame o

reerence that carries inormation about what constitutes practical signicance in the respective intervention

domain. Covered in that section are approaches that benchmark eects within such rameworks as normative

growth, dierences between students and schools with recognized practical signicance, the eects ound or

other similar interventions, and cost.


11/54

3


Inappropriate and Misleading Characterizations of

the Magnitude of Intervention Effects

Some o the most common ways to characterize the eects ound in studies o educational interventions are

inappropriate or misleading and thus best avoided. Te statistical tests routinely applied to the dierencebetween the means on outcome variables or intervention and control samples, or instance, yield ap value

the estimated probability that a dierence that large would be ound when, in act, there was no dierence

in the population rom which the samples were drawn. Very signicant dierences with, say, p < .001

are oten trumpeted as i they were indicative o especially large and important eects, ones that are more

signicant than i p were only marginally signicant (e.g.,p = .10) or just conventionally signicant (e.g.,

p = .05). Such interpretations are quite inappropriate. Tep-values characterize only statistical signicance,

which bears no necessary relationship to practical signicance or even to the statistical magnitude o the

eect. Statistical signicance is a unction o the magnitude o the dierence between the means, to be sure,

but it is also heavily infuenced by the sample size, the within samples variance on the outcome variable, the

covariates included in the analysis, and the type o statistical test applied. None o the latter is related in anyway to the magnitude or importance o the eect.

When researchers go beyond simply presenting the intervention and control group means and thep-value or

the signicance test o their dierence, the most common way to represent the eect is with a standardized

eect size statistic. For continuous outcome variables, this is almost always the standardized mean dierence

eect sizethe dierence between the means on an outcome variable represented in standard deviation

units. For example, a 10 point dierence between the intervention and control on a reading achievement test

with a pooled standard deviation o 40 or those two samples is .25 standard deviation units, that is, an eect

size o .25.

Standardized mean dierence eect sizes are a useul way to characterize intervention eects or some

purposes. Tis eect size metric, however, has very little more inherent meaning than the simple dierence

between means; it simply transorms that dierence into standard deviation units. Interpreting the

magnitude or practical signicance o an eect size requires that it be compared with appropriate criterion

values or standards that are relevant and meaningul or the nature o the outcome variable, sample,

and intervention condition on which it is based. We will have more to say about eect sizes and their

interpretation later. We raise this matter now only to highlight a widely used but, nonetheless, misleading

standard or assessing eect sizes and, at least by implication, their practical signicance.

In his landmark book on statistical power, Cohen (1977, 1988) drew on his general impression o the

range o eect sizes ound in social and behavioral research in order to create examples o power analysis or

detecting smaller and larger eects. In that context, he dubbed .20 as small, .50 as medium, and .80 as

large. Ever since, these values have been widely cited as standards or assessing the magnitude o the eects

ound in intervention research despite Cohens own cautions about their inappropriateness or such general

use. Cohen was attempting, in an unsystematic way, to describe the distribution o eect sizes one might

nd i one piled up all the eect sizes on all the dierent outcome measures or all the dierent interventions


12/54

4


targeting individual participants that were reported across the social and behavioral sciences. At that level

o generality, one could take any given eect size and say it was in the low, middle, or high range o that

distribution.

Te problem with Cohens broad normative distribution or assessing eect sizes is not the idea o comparing

an eect size with such norms. Later in this paper we will present some norms or eect sizes romeducational interventions and suggest doing just that. Te problem is that the normative distribution used as

a basis or comparison must be appropriate or the outcome variables, interventions, and participant samples

on which the eect size at issue is based. Cohens broad categories o small, medium, and large are clearly

not tailored to the eects o intervention studies in education, much less any specic domain o education

interventions, outcomes, and samples. Using those categories to characterize eect sizes rom education

studies, thereore, can be quite misleading. It is rather like characterizing a childs height as small, medium,

or large, not by reerence to the distribution o values or children o similar age and gender, but by reerence

to a distribution or all vertebrate mammals.

McCartney and Rosenthal (2000), or example, have shown that in intervention areas that involve hard

to change low baserate outcomes, such as the incidence o heart attacks, the most impressively large

eect sizes ound to date all well below the .20 that Cohen characterized as small. Tose small eects

correspond to reducing the incidence o heart attacks by about halan eect o enormous practical

signicance. Analogous examples are easily ound in education. For instance, many education intervention

studies investigate eects on academic perormance and measure those eects with standardized reading

or math achievement tests. As we show later in this paper, the eect sizes on such measures across a wide

range o interventions are rarely as large as .30. By appropriate normsthat is, norms based on empirical

distributions o eect sizes rom comparable studiesan eect size o .25 on such outcome measures is large

and an eect size o .50, which would be only medium on Cohens all encompassing distribution, wouldbe more like huge.

In short, comparisons o eect sizes in educational research with normative distributions o eect sizes to

assess whether they are small, middling, or large relative to those norms should use appropriate norms.

Appropriate norms are those based on distributions o eect sizes or comparable outcome measures rom

comparable interventions targeted on comparable samples. Characterizing the magnitude o eect sizes

relative to some other normative distribution is inappropriate and potentially misleading. Te widespread

indiscriminate use o Cohens generic small, medium, and large eect size values to characterize eect sizes in

domains to which his normative values do not apply is thus likewise inappropriate and misleading.


13/54

5


Representing Effects Descriptively

Te starting point or descriptive representations o the eects o an educational intervention is the set

o native statistics generated by whatever analysis scheme has been used to compare outcomes or the

participants in the intervention and control conditions. Tose statistics may or may not provide a valid

estimate o the intervention eect. Te quality o that estimate will depend on the research design, sample

size, attrition, reliability o the outcome measure, and a host o other such considerations. For purposes o

this discussion, we assume that the researcher begins with a credible estimate o the intervention eect and

consider only alternate representations or translations o the native statistics that initially describe that eect.

A closely related alternative starting point or a descriptive representation o an intervention eect is the

eect size estimate. Although the eect size statistic is not itsel much easier to interpret in practical terms

than the native statistics on which it is based, it is useul or other purposes. Most notably, its standardized

orm (i.e., representing eects in standard deviation units) allows comparison o the magnitude o eects

on dierent outcome variables and across dierent studies. It is thus well worth computing and reportingin intervention studies but, or present purposes, we include it among the initial statistics or which an

alternative representation would be more interpretable by most users.

In the ollowing parts o this section o the paper, we rst provide advice or conguring the native statistics

generated by common analyses in a orm appropriate or supporting alternate descriptive representations.

We include in that discussion advice or conguring the eect size statistic as well in a ew selected situations

that oten cause conusion.

Conguring the Initial Statistics that Describe an InterventionEffect to Support Alternative Descriptive Representations

Covariate Adjustments to the Means on the Outcome Variable

Several o the descriptive representations o intervention eects described later are derived directly rom

the means and perhaps the standard deviations on the outcome variable or the intervention and control

groups. However, the observedmeans or the intervention and control groups may not be the best choice or

representing an intervention eect. Te dierence between those means refects the eect o the intervention,

to be sure, but it may also refect the infuence o any initial baseline dierences between the intervention

and control groups. Te value o random assignment to conditions, o course, is that it permits only chancedierences at baseline, but this does not mean there will be no dierences, especially i the samples are not

large. Moreover, attrition rom posttest measurement undermines the initial randomization so that estimates

o eects may be based on subsets o the intervention and control samples that are not ully equivalent on

their respective baseline characteristics even i the original samples were.

Researchers oten attempt to adjust or such baseline dierences by including the respective baseline values

as covariates in the analysis. Te most common and useul covariate is the pretest or the outcome measure

along with basic demographic variables such as age, gender, ethnicity, socioeconomic status, and the like.


14/54

6


Indeed, even when there are no baseline dierences to account or, the value o such covariates (especially the

pretest) or increasing statistical power is so great that it is advisable to routinely include any covariates that

have substantial correlations with the posttest (Rausch, Maxwell, Kelley 2003). With covariates included in

the analysis, the estimation o the intervention eect is the dierence between the covariate-adjustedmeans

o the intervention and control samples. Tese adjusted values better estimate the actual intervention eect

by reducing any bias rom the baseline dierences and thus are the best choices or use in any descriptive

representation o that eect. When that representation involves the standard deviations, however, their

values should not be adjusted or the infuence o the covariates. In virtually all such instances, the standard

deviations are used as estimates o the corresponding population standard deviations on the outcome

variables without consideration or the particular covariates that may have been used in estimating the

dierence on the means.

When the analysis is conducted in analysis o covariance ormat (ANCOVA), most statistical sotware has an

option or generating the covariate-adjusted means. When the analysis is conducted in multiple regression

ormat, the unstandardized regression coecient or the intervention dummy code (intervention=1,control=0; or +0.5 vs. -0.5) is the dierence between the covariate-adjusted means. In education, analyses o

intervention eects are oten multilevel when the outcome o interest is or students or teachers who, in turn

are nested within classrooms, schools, or districts. Using multilevel regression analysis, e.g., HLM, does not

change the situation with regard to the estimate o the dierence between the covariate-adjusted meansit

is still the unstandardized regression coecient on the intervention dummy code. Te unadjusted standard

deviations or the intervention and control groups, in turn, can be generated directly by most statistical

programs, though that option may not be available within the ANCOVA, multiple regression, or HLM

routine itsel.

For binary outcomes, such as whether students are retained in grade, placed in special education status, orpass an exam, the analytic model is most oten logistic regression, a specialized variant o multiple regression

or binary dependent variables. Te regression coecient () in a logistic regression or the dummy coded

variable representing the experimental condition (e.g., 1=intervention, 0=control) is a covariate-adjusted log

odds ratio representing the intervention eect (Crichton 2001). Unlogging it (exp) produces the covariate-

adjusted odds ratio or the intervention eect, which can then be converted back into the terms o the

original metric.

For example, an intervention designed to improve the passing rate on an algebra exam might produce

the results shown below. Te odds o passing or a given group are dened as the ratio o the number

(or proportion) who pass to the number (or proportion) who ail. For the intervention group, thereore,the odds o passing are 45/15=3.0 and, or the control group, the odds are 30/30=1.0. Te odds ratio

characterizing the intervention eect is the ratio o these two values, that is 3/1=3, and indicates that the

odds o passing are three times greater or a student in the intervention group than or one in the control

group.


15/54

7


Passed Failed

Intervention 45 15

Control 30 30

Suppose the researcher analyzes these outcomes in a logistic regression model with race, gender, and priormath achievement scores included as covariates to control or initial dierences between the two groups.

I the coecient on the intervention variable in that analysis, converted to a covariate-adjusted odds-

ratio, turns out to be 2.53, it indicates that the unadjusted odds ratio overestimated the intervention eect

because o baseline dierences that avored the intervention group. With this inormation, the researcher

can construct a covariate-adjusted version o the original 2x2 table that estimates the proportions o students

passing in each condition when the baseline dierences are taken into account. o do this, the requencies

or the control sample and the total N or the intervention sample are taken as given. We then want to know

what passing requency, p, or the intervention group allows the odds ratio, (p30)/((60-p)30), to equal

2.53. Solving orp reveals that it must be 43. Te covariate-adjusted results, thereore, are as shown below.

Described as simple percentages, the covariate-adjusted estimate is that the intervention increased the 50%pass rate o the control condition to 72% (43/60) in the intervention condition.

Passed Failed

Intervention 43 17

Control 30 30

Identifying or Obtaining Appropriate Effect Size Statistics

A number o the ways o representing intervention eects and assessing their practical signicance describedlater in this paper can be derived directly rom the standardized mean dierence eect size statistic,

commonly reerred to simply as the eect size. Tis eect size is dened as the dierence between the mean

o the intervention group and the mean o the control group on a given outcome measure divided by the

pooled standard deviations or those two groups, as ollows:

Where is the mean o the intervention sample on an outcome variable, Cis the mean o the controlsample on that variable, and sp is the pooled standard deviation. Te pooled standard deviation is obtained as

the square root o the weighted mean o the two variances, dened as:

where nand nCare the number o respondents in the intervention and control groups, and sand sCare

the respective standard deviations on the outcome variable or the intervention and control groups.


16/54

8


Te eect size is typically reported to two decimal places and, by convention, has a positive value when the

intervention group does better on the outcome measure than the control group and a negative sign when it

does worse. Note that this may not be the same sign that results rom subtraction o the control mean rom

the intervention mean. For example, i low scores represent better perormance, e.g., as with a measure o the

number o errors made, then subtraction will yield a negative value when the intervention group perorms

better than the control, but the eect size typically would be given a positive sign to indicate the betterperormance o the intervention group.

Eect sizes can be computed or estimated rom many dierent kinds o statistics generated in intervention

studies. Inormative sources or such procedures include the What Works Clearinghouse Procedures and

Standards Handbook(2011; Appendix B) and Lipsey and Wilson (2001; Appendix B). Here we will

only highlight a ew eatures that may help researchers identiy or congure appropriate eect sizes or

use in deriving alternative representations o intervention eects. Moreover, many o these eatures have

implications or statistics other than the eect size that are involved in some representations o intervention

eects.

Clear understanding o what the numerator and denominator o the standardized mean dierence eect size

represent will allow many common mistakes and conusions in the computation and interpretation o eect

sizes to be avoided. Te numerator o the eect size estimates the dierence between the experimental groups

on the means o the outcome variable that is attributable to the intervention. Tat is, the numerator should

be the best estimate available o the mean intervention eect estimated in the units o the original metric.

As described in the previous subsection, when researchers include baseline covariates in the analysis, the best

estimate o the intervention eect is the dierence between the covariate-adjusted means on the outcome

variable, not the dierence between the unadjusted means.

Te purpose o the denominator o the eect size is to standardize the dierence between the outcome

means in the numerator into metric ree standard deviation units. Te concept ostandardization is critical

here. Standardization means that each eect size is represented in the same way, i.e., in a standard way,

irrespective o the outcome construct, the way it is measured, or the way it is analyzed. Te sample standard

deviations used or this purpose estimate the corresponding population standard deviations on the outcome

measure. As such, the standard deviations should not be adjusted by any covariates that happened to be used

in the design or analysis o the particular study. Such adjustments would not have general applicability to

other designs and measures and thus would compromise the standardization that is the point o representing

the intervention eect in standard deviation units. Tis means that the raw standard deviations or the

intervention and control samples should be pooled into the eect size denominator, even when multilevelanalysis models with complex variance structures are used.

Pooling the sample standard deviations or the intervention and control groups is intended to provide the

best possible estimate o the respective population standard deviation by using all the data available. Tis

procedure assumes that both those standard deviations estimate a common population standard deviation.

Tis is the homogeneity o variance assumption typically made in the statistical analysis o intervention

eects. I homogeneity o variance cannot be assumed, then consideration has to be given to the reason why


17/54

9


the intervention and control group variances dier. In a randomized experiment, this should not occur on

outcome variables unless the intervention itsel aects the variance in the intervention condition. In that

case, the better estimate may be the standard deviation o the control group even though it is estimated on a

smaller sample than the pooled version.

In the multilevel situations common in education research, a related matter has to do with the populationthat is relevant or purposes o standardizing the intervention eect. Consider, or example, outcomes on an

achievement test that is, or could be, used nationally. Te variance or the national population o students

can be partitioned into between and within components according to the dierent units represented at

dierent levels. Because state education systems dier, we might rst distinguish between-state and within-

state variance. Within states, there would be variation between districts; within districts, there would

be variation between schools; within schools, there would be variation between classrooms; and within

classrooms, there would be variation between students. Te total variance or the national population can

thus be decomposed as ollows (Hedges 2007):

In an intervention study using a national sample, the sample estimate o the standard deviation includes all

these components. Any eect size computed with that standard deviation is thus standardizing the eect size

with the national population variance as the reerence value. Te standard deviation computed in a study

using a sample o students rom a single classroom, on the other hand, estimates only the variance o the

population o students who might be in that classroom in that school in that district in that state. In other

words, this standard deviation does not include the between classroom, between school, between district,

and between state components that would be included in the estimate rom a national sample. Similarly,

an intervention study that draws its sample rom one school, or one district, will yield a standard deviationestimate that is implicitly using a narrower population as the basis or standardization than a study with a

broader sample. Tis will not matter i there are no systematic dierences on the respective outcome measure

between students in dierent states, districts, schools, and classrooms, i.e., those variance components are

zero. With student achievement measures, we know this is generally not the case (e.g., Hedges and Hedberg

2007). Less evidence is available or other measures used in education intervention studies, but it is likely

that most o them also show nontrivial dierences between these dierent units and levels.

Any researcher computing eect sizes or an intervention study or using them as a basis or alternative

representations o intervention eects should be aware o this issue. Eect sizes based on samples o narrower

populations will be larger than eect sizes based on broader samples even when the actual magnitudes o theintervention eects are identical. And, that dierence will be carried through to any other representation o

the intervention eect that is based on the eect size. Compensating or that dierence, i appropriate, will

require adding or subtracting estimates o the discrepant variance components, with the possibility that those

components will have to be estimated rom sources outside the research sample itsel.


18/54

10


Te discussion above assumes that the units on which the sample means and standard deviations are

computed or an outcome variable are individuals, e.g., students. Te nested data structures common in

education intervention studies, however, provide dierent units on which means and standard deviations

can be computed, e.g., students, clusters o students in classrooms, and clusters o classrooms in schools.

For instance, in a study o a whole school intervention aimed at improving student achievement, with

some schools assigned to the intervention condition and others to the control, there are two eect sizes theresearcher could estimate. Te conventional eect size would standardize the intervention eect estimated

on student scores using the pooled student level standard deviations. Alternatively, the student level scores

might be aggregated to the school level and the school level means could be used to compute an eect size.

Tat eect size would represent the intervention eect in standard deviation units that refect the variance

between schools, not that between students. Te result is a legitimate eect size, but the school units on

which it is based make this eect size dierent rom the more conventional eect size that is standardized on

variation between individuals.

Te numerators o these two eect sizes would not necessarily dier greatly. Te respective means o the

student scores in the intervention and control groups would be similar to the means o the school-level

means or those same students unless the number o students in each school diers greatly and is correlated

with the school means. However, the standard deviations will be quite dierent because the variance

between schools is only one component o the total variance between students. Between-school variance on

achievement test scores is typically around 20-25% o the total variance, the intraclass correlation coecient

(ICC) or schools (Hedges and Hedberg 2007). Te between schools standard deviation thus will be about

= .50 or less o the student level standard deviation and the eect size based on school units will be

about twice as large as the eect size based on students as the units even though both describe the same

intervention eect.

Similar situations arise in multilevel samples whenever the units on which the outcome is measured are

nested within higher level clusters. Each such higher level cluster allows or its own distinctive eect size

to be computed. A researcher comparing eect sizes in such situations or, more to the point or present

purposes, using an eect size to derive other representations o intervention eects, must know which eect

size is being used. An eect size standardized on a between-cluster variance component will nearly always be

larger than the more conventional eect size standardized on the total variance across the lower level units

on which the outcome was directly measured. Tat dierence in numerical magnitude will then be carried

into any alternate representation o the intervention eect based on that eect size and the results must be

interpreted accordingly.

Descriptive Representations of Intervention Effects

Representation in Terms of the Original Metric

Beore looking at dierent ways o transorming the dierence between the means o the intervention

and control samples into a dierent orm, we should rst consider those occasional situations in which

dierences on the original metric are easily understood without such manipulations. Tis occurs when the


19/54

11


units on the measure are suciently amiliar and well dened that little urther description or interpretation

is needed. For example, an outcome measure or a truancy reduction program might be the proportion

o days on which attendance was expected or which the student was absent. Te outcome score or each

student, thereore, is a simple proportion and the corresponding value or the intervention or control

groups is the mean proportion o days absent or the students in that group. Common events o this sort in

education that can be represented as counts or proportions include dropping out o school, being expelledor suspended, being retained in grade, being placed in special education status, scoring above a prociency

threshold on an achievement test, completing assignments, and so orth.

Intervention eects on outcome measures that involve well recognized and easily understood events can

usually be readily interpreted in their native orm by researchers, practitioners, and policymakers. Some

caution is warranted, nevertheless, in presenting the dierences between intervention and control groups

in terms o the proportions o such events. Dierences between proportions can have dierent implications

depending on whether those dierences are viewed in absolute or relative terms. Consider, or example,

a dierence o three percentage points between the intervention and control groups in the proportion

suspended during the school year. Viewed in absolute terms, this appears to be a small dierence. But relative

to the suspension rate or the control sample a three point decrease might be substantial. I the suspension

rate or the control sample is only 5%, or instance, a decrease o three percentage points reduces that rate by

more than hal. On the other hand, i the control sample has a suspension rate o 40%, a reduction o three

percentage points might rightly be viewed as rather modest.

In some contexts, the numerical values on an outcome measure that does not represent amiliar events may

still be suciently amiliar that dierences are well-understood despite having little inherent meaning.

Tis might be the case, or instance, with widely used standardized tests. For example, the Peabody Picture

Vocabulary est (PPV; Dunn and Dunn 2007), one o the most widely used tests in education, is normedso that standard scores have a mean o 100 or the general population o children at any given age. Many

researchers and educators have sucient experience with this test to understand what scores lower or

higher than 100 indicate about childrens skill level and how much o an increase constitutes a meaningul

improvement. Generally speaking, however, such amiliarity with the scoring o a particular measure o this

sort is not widespread and most audiences will need more inormation to be able to interpret intervention

eects expressed in terms o the values generated by an outcome measure.

Intervention Efects in Relation to Pre-Post Change

When pretest measures o an outcome variable are available, the pretest means may be used to provide an

especially inormative representation o intervention eects using the original metric. Tis ollows rom the

act that the intent o interventions is to bring about change in the outcome; that is, change between pretest

and posttest. Te ull representation o the intervention eect, thereore, is not simply the dierence between

the intervention and control samples on the outcome measure at posttest, but the dierential change

between pretest and posttest on that outcome. By showing eects as dierential change, the researcher

reveals not only the end result but the patterns o improvement or decline that characterize the intervention

and control groups.


20/54

12


Consider, or example, a program that instructs middle school students in confict resolution techniques

with the objective o decreasing interpersonal aggression. Student surveys at the beginning and end o the

school year are administered or intervention and control schools that provide composite scores or the

amount o physical, verbal, and relational aggression students experience. Tese surveys show signicantly

lower levels or the intervention schools than the control schools, indicating a positive eect o the confict

resolution program, say a mean o 23.8 or the overall total score or the students in the interventionschools and 27.4 or the students in the control schools. Tat 3.6 point avorable outcome dierence,

however, could have come rom any o a number o dierent patterns o change over the school year or

the intervention and control schools. able 1 below shows some o the possibilities, all o which assume an

eective randomization so that the pretest values at the beginning o the school year were virtually identical

or the intervention and control schools.

Table 1. Pre-post change differentials that result in the same posttest difference

Scenario A Scenario B Scenario C

Pretest Posttest Pretest Posttest Pretest Posttest

Intervention 25.5 23.8 17.7 23.8 22.9 23.8

Control 25.6 27.4 17.6 27.4 23.0 27.4

As can be seen even more clearly in Figure 1, or Scenario A the aggression levels decreased somewhat in the

intervention schools while increasing in the control schools. In Scenario B, the aggression levels increased

quite a bit (at least relative to the intervention eect) in both samples, but the amount o the increase was

not as great in the intervention schools as the control schools. In Scenario C, on the other hand, there was

little change in the reported level o aggression over the course o the year in the intervention schools, but

things got much worse during that time in the control schools. Tese dierent patterns o dierential pre-post change depict dierent trajectories or aggression absent intervention and give dierent pictures o

what it is that the intervention accomplished. In Scenario A it reversed the trend that would have otherwise

occurred. In Scenario B, it ameliorated an adverse trend, but did not prevent it rom getting worse. In

Scenario C, the intervention did not produce appreciable improvement over time, but kept the amount o

aggression rom getting worse.


21/54

13


Figure 1. Pre-post change for the three scenarios with the same posttest difference

Pretest Posttest

31

29

27

25

23

21

19

17

15

Intervention Control Intervention Control Intervention Control

Scenario A

Pretest Posttest

31

29

27

25

23

21

19

17

15

Scenario B

Pretest Posttest

31

29

27

25

23

21

19

17

15

Scenario C

As this example illustrates, a much uller picture o the intervention eect is provided when the dierence

between the intervention and control samples on the outcome variable is presented in relation to where those

samples started at the pretest baseline. A ner point can be put on the dierential change or the

intervention and control groups, i desired, by proportioning the intervention eect against the control

group pre-post change. In Scenario B above, or instance, the dierence between the control groups pretest

and posttest composite aggression scores is 9.8 (27.4 - 17.6) while the posttest dierence between the

intervention and control group is -3.6 (23.8 - 27.4). Te intervention, thereore, reduced the pre-post

increase in the aggression score by 36.7% (-3.6/9.8).

Overlap Between Intervention and Control Distributions

I the distributions o scores on an outcome variable were plotted separately or the intervention and

control samples, they might look something like Figure 2 below. Te magnitude o the intervention eect

is represented directly by the dierence between the means o the two distributions. Te standardized

mean dierence eect size, discussed earlier, also represents the dierence between the two means, but

does so in standard deviation units. Still another way to represent the dierence between the outcomes

or the intervention and control groups is in terms o the overlap between their respective distributions.

When the dierence between the means is larger, the overlap is smaller; when the dierence between the

means is smaller, the overlap is larger. Te amount o overlap, in turn, can be described in terms o the

proportion o individuals in each distribution who are above or below a specied reerence point on oneo the distributions. Proportions o this sort are oten easy to understand and appraise and, thereore, may

help communicate the magnitude o the eect. Various ways to take advantage o this circumstance in the

presentation o intervention eects are described below.


22/54

14


Figure 2. Intervention and control distributions on an outcome variable

Intervention Efects Represented in Percentile Form

For outcomes assessed with standardized measures that have norms, a simple representation o the

intervention eect is to characterize the means o the control and intervention samples according to the

percentile values they represent in the norming distribution. For a normed, standardized measure, these

values are oten provided as part o the scoring scheme or that measure. On a standardized math reading

achievement test, or instance, suppose that the mean o the control sample ell at the 47th percentile

according to the test norms or the respective age group and the mean o the intervention sample ell at

the 52nd percentile. Tis tells us, rst, that the mean outcome or the control sample is somewhat below

average perormance (50th percentile) relative to the norms and, second, that the eect o the intervention

was to improve perormance to the point where it was slightly above average. In addition, we see that

the corresponding increase was 5 percentile points. Tat 5 percentile point dierence indicates that the

individuals receiving the intervention, on average, have now caught up with the 5% o the norming

population that otherwise scored just above them.

In their study o each or America, Decker, Mayer, and Glazerman (2004) used percentiles in this way to

characterize the statistically signicant eect they ound on student math achievement scores. Decker et al.

also reported the pretest means as percentiles so that the relative gain o the intervention sample was evident.

Tis representation revealed that the students in the control sample were at the 15th percentile at both thepretest and posttest whereas the intervention sample gained 3 percentiles by moving rom the 14th to the

17th percentile.

It should be noted that the percentile dierences on the norming distribution that are associated with a

given dierence in scores will vary according to where the scores all in the distribution. able 2 shows the

percentile levels or the mean score o the lower scoring experimental group (e.g., control group when its

mean score is lower than that o the treatment group) in the rst column. Te numbers in the body o the

table then show the corresponding percentile level o the other group (e.g., treatment) that are associated


23/54

15


with a range o score dierences represented in standard deviation units (which thereore means these

dierences can also be interpreted as standardized mean dierence eect sizes). As shown there, i one group

scores at the 50th percentile and the other has a mean score that is .50 standard deviations higher, that

higher group will be at the 69th percentile or a dierence o 19 in the percentile ranking. A .50 standard

deviation dierence between a group scoring at the 5th percentile and a higher scoring group will put that

other group at the 13th percentile or a dierence o only 8 in the percentile ranking. Tis same pattern ordierences between two groups also applies to pre-post gains or one group. Researchers should thereore be

aware that intervention eects represented as percentile dierences or gains on a normative distribution can

look quite dierent depending on whether the respective scores all closer to the middle or the extremes o

the distribution.

Table 2. Upper percentiles for selected differences or gains from a lower percentile

LowerPercentile

Dierence or Gain in Standard Deviations

.10 .20 .50 .80 1.00

5th 6 7 13 20 26

10th 12 14 22 32 39

15th 17 20 30 41 48

25th 28 32 43 54 62

50th 54 58 69 79 84

75th 78 81 88 93 95

85th 87 89 94 97 98

90th 92 93 96 98 99

95th 96 97 98 99 99NOTE: Table adapted from Albanese (2000).

A similar use o percentiles can be applied to the outcome scores in the intervention and control groups

when those scores are not reerenced to a norming distribution. Te distribution o scores or the control

group, which represents the situation in the absence o any infuence rom the intervention under study,

can play the role o the norming distribution in this application. Te proportion o scores alling below the

control group and intervention group means can then be transormed into the corresponding percentile

values on the control distribution. Tese values can be obtained rom the cumulative requency tables that

most statistical analysis computer programs readily produce or the values on any variable. For a symmetrical

distribution, the mean o the control sample will be at the 50th percentile (the median). Te mean scoreor the intervention sample can then be represented in terms o its percentile value on that same control

distribution. Tus we may nd that the mean or the intervention group alls at the 77th percentile o the

control distribution, indicating that its mean is now higher than 77% o the scores in the control sample.

With a control group mean at the 50th percentile, another way o describing the dierence is that the

intervention has moved 27% o the sample rom a score below the control mean to one above that mean.


24/54

16


Tis comparison is shown in Figure 3.

Figure 3. Percentile values on the control distribution of the means of the control and

intervention groups

Intervention Efects Represented as Proportions Above or Below a Reerence Value

Te representation o an intervention eect as the percentiles on a reerence distribution, as described

above, is based on the proportions o the respective groups above and below a specic threshold value on

that reerence distribution. A useul variant o this approach is to select an inormative threshold value on

the control distribution and depict the intervention eect in terms o the proportion o intervention cases

above or below that value in comparison to the corresponding proportions o control cases. Te result then

indicates how many more o the intervention cases are in the desirable range dened by that threshold thanexpected without the intervention.

When available, the most meaningul threshold value or comparing proportions o intervention and control

cases is one externally dened to have substantive meaning in the intervention context. Such threshold

values are oten dened or criterion-reerenced tests. For example, thresholds have been set or the National

Assessment o Educational Progress (NAEP) achievement tests with cuto scores that designate Basic,

Proicient, andAdvancedlevels o perormance. On the NAEP math achievement test, or instance, scores

between 299 and 333 are identied as indicating that 8th grade students are procient. I we imagine that

we might assess a math intervention using the NAEP test, we could compare the proportion o students

in the intervention versus control conditions who scored 300 or above

that is, were at least minimallyprocient. Figure 4 shows what the results might look like. In this example, 36% o the control students

scored above that threshold level whereas 45% o the intervention students did so.


25/54

17


Figure 4. Proportion of the control and intervention distributions scoring above an externally

dened prociency threshold score

Similar thresholds might be available rom the norming data or a standardized measure. For example, the

mean standard score or the Peabody Picture Vocabulary test (PPV; Dunn and Dunn 2007) is 100, which

is, thereore, the mean age-adjusted score in the norming sample. Assuming representative norms, that scorerepresents the population average or children o any given age. For an intervention with the PPV as an

outcome measure, the intervention eect could be described in terms o the proportion o children in the

intervention versus control samples scoring 100 or above. I the proportion or either sample is at least .50, it

tells us that their perormance is average or their age. Suppose that or a control group, 32% scored 100 or

above at posttest, identiying them as a low perorming sample. I 38% o the intervention group scored 100

or above, we see that the eect o the intervention has been to move 6% o the children rom the below

average to the above average range. At the same time, we see that this has not been sucient to close the gap

between them and normative perormance on this measure.

With a little eort, researchers may be able to identiy meaningul threshold values or measures that donot already have one dened. Consider a multi-item scale on which teachers rate the problem behavior o

students in their classrooms. When pretest data are collected on this measure, the researcher might also ask

each teacher to nominate several children who are borderlinenot presenting signicant behavior problems,

but close to that point. Te scores o those children could then be used to identiy the approximate point on

the rating scale at which teachers begin to view the classroom behavior o a child as problematic. Tat score

then provides a threshold value that allows the researcher to describe the eects o, say, a classroom behavior

management program in terms o how many ewer students in the intervention condition than the control


26/54

18


condition all in the problem range. Dierences on the means o an arbitrarily scored multi-item rating scale,

though critical to the statistical analysis, are not likely to convey the magnitude o the eect as graphically as

this translation into proportions o children above a threshold teachers themselves identiy.

Absent a substantively meaningul threshold value, an inormative representation o the intervention eect

might still be provided with a generic threshold value. Cohen (1988), or instance, used the control groupmean as a general threshold value to create an index he called U3, one o several indices he proposed to

describe the degree o non-overlap between control and intervention distributions. Te example shown in

Figure 3, presented earlier to illustrate the use o percentile values, similarly made the control group mean

the key reerence value.

With the actual scores in hand or the control and intervention groups, it is straightorward or a researcher

to determine the proportion o each above (or below) the control mean. Assuming normal distributions,

those proportions and the corresponding percentiles or the control and intervention means can easily

be linked to the standardized mean dierence eect size through a table o areas under the normal curve.

Te mean o a normally distributed control sample is at the 50th percentile with a z-score o zero. Adding

the standardized mean dierence eect size to that z-score then identies the z-score o the intervention

mean on the control distribution. With a table o areas under the normal curve, that z-score, in turn, can

be converted to the equivalent percentile and proportions in the control distribution. able 3 shows the

proportion o intervention cases above the control sample mean or dierent standardized mean dierence

eect size values, assuming normal distributionsCohens (1988) U3 index. In each case, the increase over

.50 indicates the additional proportion o the cases that the intervention has pushed above that control

condition mean.

Rosenthal and Rubin (1982) described yet another generic threshold or comparing the relative proportionso the control and intervention groups attaining it within a ramework they called the Binomial Eect

Size Display (BESD). In this scheme, the key success threshold value is the grand median o the combined

intervention and control distributions. When there is no intervention eect, the means o both the

intervention and control distributions all at that grand median. As the intervention eect gets larger and

the intervention and control distributions separate, smaller proportions o the control distribution and larger

proportions o the intervention distribution all above that grand median. Figure 5 depicts this situation.


27/54

19


Table 3. Proportion of intervention cases above the mean of the control distribution

Eective SizeProportion above the

Control Mean Eect SizeProportion abovethe Control Mean

.10 .54 1.30 .90

.20 .58 1.40 .92

.30 .62 1.50 .93

.40 .66 1.60 .95

.50 .69 1.70 .96

.60 .73 1.80 .96

.70 .76 1.90 .97

.80 .79 2.00 .98

.90 .82 2.10 .98

1.00 .84 2.20 .99

1.10 .86 2.30 .99

1.20 .88 2.40 .99

Figure 5. Binomial effect size displayProportion of cases above and below the grand median

Using the grand median as the threshold value makes the proportion o the intervention sample above the

threshold value equal to the proportion o the control sample below that value. Te dierence between these

proportions, which Rosenthal and Rubin called the BESD Index, indicates how many more intervention

cases are above the grand median than control cases. Assuming normal distributions, the BESD can also be

linked to the standardized mean dierence eect size. An additional and occasionally convenient eature o

the BESD is that it is equal to the eect size expressed as a correlation; that is, the correlation between the

treatment variable (coded as 1 vs. 0) and the outcome variable. Many researchers are more amiliar with


28/54

20


correlations than standardized mean dierences, so the magnitude o the eect expressed as a correlation may

be somewhat more interpretable or them. able 4 shows the proportions above and below the grand median

and the BESD as the intervention eect sizes get larger. It also shows the corresponding correlational

equivalents or each eect size and BESD.

Table 4. Relationship of the effect size and correlation coefcient to the BESD

Eective Size r

Proportion o control/intervention cases above

the grand median

BESD(dierence between the

proportions)

.10 .05 .47 / .52 .05

.20 .10 .45 / .55 .10

.30 .15 .42 / .57 .15

.40 .20 .40 / .60 .20

.50 .24 .38 / .62 .24

.60 .29 .35 / .64 .29

.70 .33 .33 / .66 .33

.80 .37 .31 / .68 .37

.90 .41 .29 / .70 .41

1.00 .45 .27 / .72 .45

1.10 .48 .26 / .74 .48

1.20 .51 .24 / .75 .51

1.30 .54 .23 / .77 .54

1.40 .57 .21 / .78 .57

1.50 .60 .20 / .80 .60

1.60 .62 .19 / .81 .62

1.70 .65 .17 / .82 .65

1.80 .67 .16 / .83 .67

1.90 .69 .15 / .84 .69

2.00 .71 .14 / .85 .71

2.10 .72 .14 / .86 .72

2.20 .74 .13 / .87 .742.30 .75 .12 / .87 .75

2.40 .77 .11 / .88 .77

All the variations on representing the proportions o the intervention and control group distributions

above or below a threshold value require dichotomizing the respective distributions o scores. It should be

noted that we are not advocating that the statistical analysis be conducted on any such dichotomized data.

It is well known that such crude dichotomizations discard useul data and generally weaken the analysis

(Cohen 1983; MacCallum et al. 2002). What is being suggested is that, ater the ormal statistical analysis


29/54

21


is done and the results are known, depicting intervention eects in one o the ways described here may

communicate their magnitude and practical implications better than means, standard deviations, t-test

values, and the other native statistics that result directly rom the analysis.

In applying any o these techniques, some consideration should be given to the shape o the respective

distributions. When the outcome scores are normally distributed, the application o these techniques isrelatively tidy and straightorward. When the data are not normally distributed, the respective empirical

distributions can always be dichotomized to determine what proportions o cases are above or below any

given reerence value o interest, but the linkage between those proportions and other representations may

be problematic or misleading. Percentile values, and dierences in those values, or instance, take on quite

a dierent character in skewed distributions with long tails than in normal distributions, as do the standard

deviation units in which standardized mean dierence eect sizes are represented.

Standard Scores and Normal Curve Equivalents (NCE)

Standard scores are a conversion o the raw scores on a norm reerenced test that draws upon the normingsample used by the test developer to characterize the distribution o scores expected rom the population or

which the test is intended. A linear transorm o the raw scores is applied to produce tidier numbers or the

mean and standard deviation. For many standardized measures, or instance, the standard score mean may

be set at 100 with a standard deviation o 15.

Presenting intervention eects in terms o standard scores can make those eects easier to understand in

some regards. For example, the mean scores or the intervention and control groups can be easily assessed

in relation to the mean or the norming sample. Mean scores below the standardized mean score, e.g., 100,

indicate that the sample, on average, scores below the mean or the population represented in the norming

sample. Similarly, a standard score mean o, say, 95 or the control group and 102 or the intervention groupindicates that the eect o the intervention was to improve the scores o an underperorming group to the

point where their scores were more typical o the average perormance o the norming sample.

An important characteristic o standard scores or tests and measures used to assess student perormance is

that those scores are typically adjusted or the age o the respective students. Te population represented in

the norming sample rom which the standard scores are derived is divided into age or school grade groups

and the standard scores are determined or each group. Tus the standard scores or, say, the students in the

norming sample who are in the ourth grade and average 9 years o age may be scaled to have a mean o 100

and a standard deviation o 15, but so will the standard scores or the students in the sixth grade with an

average age o 11 years. Dierent standardized measures may use dierent age groups or this purpose, e.g.,

diering by as little as a month or two or as much as a year or more.

Tese age adjustments o standard scores have implications or interpreting changes in those scores over time

because those changes are depicted relative to the change or same aged groups in the norming sample. A

control sample with a mean standard score o 87 on the pretest and a mean score o 87 on the posttest a year

later has not ailed to make gains but, rather, has simply kept abreast o the dierences by age in the norming

sample. On the other hand, an intervention group with a mean pretest standard score o 87 and mean


30/54

22


posttest score o 95 has improved at a rate aster than that represented in the comparable age dierences in

the norming sample. Tis characteristic allows some interpretation o the extent to which intervention eects

accelerate growth, though that depends heavily on the assumption that the sample used in the intervention

study is representative o the norming sample used by the test developer.

Reporting intervention eects in standard score units thus has some modest advantages or interpretabilitybecause o the implicit comparison with the perormance o the norming sample. Moreover, the means and

standard deviations or standard scores are usually assigned simple round numbers that are easy to remember

when making such comparisons. In other ways standard scores are not so tidy. Most notably, standard scores

typically have a rather odd range. With a normal distribution encompassing more than 99% o the scores

within 3 standard deviations, standard scores with a mean o 100 and a standard deviation o 15 will range

rom about 55 at the lowest to about 145 at the highest. Tese are not especially intuitive numbers or the

bottom and top o a measurement scale. For this reason, researchers may preer to represent treatment eects

in terms o some variant o standard scores. One such variant that is well known in education is the normal

curve equivalent.

Normal curve equivalents. Normal curve equivalents (NCE) are a metric developed in 1976 or the U.S.

Department o Education or reporting scores on norm-reerenced tests and allowing comparison across tests

(Hills 1984; allmadge and Wood 1976). NCE scores are standard scores based on an alternative scaling o

the z-scores or measured values in a normal distribution derived rom the norming sample or the measure.

Unlike the typical standard score, as described above, NCE scores are scaled so that they range rom a low

around 0 to a high o around 100, with a mean o 50. NCE scores, thereore, allow scores, dierences in

scores, and changes in scores to be appraised on a 100 point scale that starts at zero.

NCE scores are computed by rst transorming the original raw scores into normalized z-scores. Te z-scoreis the original score minus the mean or all the scores divided by the standard deviation; it indicates the

number o standard deviations above or below a mean o zero that the score represents. Te NCE score is

then computed as NCE = 21.06(z-score) + 50; that is, 21.06 times the z-score plus 50. Tis produces a set

o NCE scores with a mean o 50 and a standard deviation o 21.06. Note that the standard deviation or

NCE scores is not as tidy as the round number typically used or other standard scores, but it is required to

produce the other desirable characteristics o NCE scores.

As a standard score, NCEs are comparable across all the measures that derive and provide NCE scores

rom their norming samples i those samples represent the same population. Tus while a raw score o 82

on a particular reading test would not be directly comparable to the same numerical score on a dierentreading test measuring the same construct but scaled in a dierent way (i.e., a dierent mean and standard

deviation), the corresponding NCE scores could be compared. For example, i the NCE score corresponding

to 82 on the rst measure was 68 and that corresponding to 82 on the second measure was 56, we could

rightly judge that the rst students reading perormance was better than that o the second student.


31/54

23


When NCE scores are available or can be derived rom the scoring scheme or a normed measure, using

them to report the pretest and posttest means or the intervention and control samples may help readers

better understand the nature o the eects. It is easier to judge the dierence between the intervention and

control means when the scores are on a 0-100 scale than when they are represented in a less intuitive metric.

Tus a 5-point dierence on a 0-100 scale might be easier to interpret than a 5-point dierence on a raw

score metric that ranges rom, e.g., 143 to 240. NCE scores also preserve the advantage o standard scoresdescribed above o allowing implicit comparisons with the perormance o the norming sample. Tus mean

scores over 50 show better perormance than the comparable norming sample and mean scores under 50

show poorer perormance.

Although standard scores, and NCE scores in particular, oer a number o advantages as a metric with

which to describe intervention eects, they have several limitations. First, standard scores are all derived

rom the norming sample obtained by the developer o the measure. Tus these scores assume that sample

is representative o the population o interest to the intervention study and that the samples in the study,

in turn, are representative o the norming sample. Tese assumptions could easily be alse or intervention

studies that ocus on populations distinctly appropriate or the intervention o interest. Similar discrepancies

could arise or any age-adjusted standard score i the norming measures and the intervention measures

were administered at very dierent times during the school yeardierences could then be the result o

predictable growth over the course o that year (Hills 1984).

Grade Equivalent Scores

A grade equivalent (GE) is a developmental score reported or many norm-reerenced tests that characterizes

students achievement in terms o the grade level o the students in the normative sample with similar

perormance on that test. Grade equivalent scores are based on the nine-month school year and are

represented in terms o the grade level and number o ull months within a nine-month school year. A

GE score thus corresponds to the mean level o perormance at a certain point in time in the school year

or a given grade. Te grade level is represented by the rst number in the GE score and the month o the

school year ollows ater a period with months ranging rom a value o 0 (September) to 9 (June). A GE o

6.2, or example, represents the score that would be achieved by an average student in the sixth grade ater

completion o the second ull month o school. Te dierence between GE scores o 5.2 (November o grade

5) and 6.2 (November o grade 6) represents one calendar years growth or change in perormance.

Te GE score or an individual student in a given grade, or the mean or a sample o students, is inherently

comparative. A GE score that diers rom the grade level a student is actually in indicates perormancebetter or worse than that o the average students at that same grade level in the norming sample. I the mean

GE or a sample o students tested near the end o the ourth grade is 5.3, or instance, these students are

perorming at the average level o students tested in December o the th grade in the norming sample; that

is, they are perorming better than expected or their actual grade level. Conversely, i their mean GE is 4.1,

they are perorming below what is expected or their actual grade level. Tese comparisons, o course, assume

that the norming sample is representative o the population rom which the research sample is drawn.


32/54

24


Intervention eects are represented in terms o GE scores simply as the dierence between the intervention

and control sample means expressed in GE units. For example, in a study o Success or All (SFA), a

comprehensive reading program, Slavin and colleagues (1996) compared the mean GE scores or the

intervention and control samples on a reading measure used as a key outcome variable. Tough the numerical

dierences in mean GE scores between the samples were not reported, Figure 6 shows the approximate

magnitudes o those dierences. Te ourth grade students in SFA, or instance, scored on average about 1.8GE ahead o the control sample, indicating that their perormance was closer to that o the mean or ourth

graders in the norming sample than that o the control group. Note also that the mean GE or each sample

taken by itsel identies the groups perormance level relative to the normative sample. Te ourth grade

control sample mean, at 3.0, indicates that, on average, these students were not perorming up to the grade

level mean in the norming sample whereas the SFA sample, by comparison, was almost to grade level.

Figure 6. Mean reading grade equivalent (GE) scores of success for all and control samples

[Adapted from Slavin et al. 1996]

Grade

MeanReadingGEScore

5

4.5

4

3.5

3

2.5

2

1.5

1

SFA Control

Te GE score is oten used to communicate with educators and parents because o its simplicity and inherent

meaningulness. It makes perormance relative to the norm and the magnitude o intervention eects easy to

understand. Furthermore, when used to index change over time, the GE score is an intuitive way to

represent growth in a students achievement. Te simplic

Translating the statistical representation of the effects of education interventions

Documents

Transcript of Translating the statistical representation of the effects of education interventions