Translating the statistical representation of the effects of education interventions
-
Upload
amerikayeah -
Category
Documents
-
view
216 -
download
0
Transcript of Translating the statistical representation of the effects of education interventions
-
7/30/2019 Translating the statistical representation of the effects of education interventions
1/54
NCSER 2013-300
U.S. DEPARTMENT OF EDUCATION
Translating the Statistical
Representation of the Effects of
Education Interventions Into MoreReadily Interpretable Forms
NOVEMBER 2012
Mark W. Lipsey, Kelly Puzio, Cathy Yun, Michael A. Hebert, Kasia Steinka-Fry,
Mikel W. Cole, Megan Roberts, Karen S. Anthony, and Matthew D. Busick
-
7/30/2019 Translating the statistical representation of the effects of education interventions
2/54
Page intentionally let blank.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
3/54
Translating the StatisticalRepresentation of the Effects ofEducation Interventions Into More
Readily Interpretable Forms
NOVEMBER 2012
Mark W. LipseyPeabody Research InstituteVanderbilt University
Kelly PuzioDepartment o eaching and LearningWashington State University
Cathy YunVanderbilt University
Michael A. HebertDepartment o Special Education and Communication DisordersUniversity o Nebraska-Lincoln
Kasia Steinka-FryPeabody Research InstituteVanderbilt University
Mikel W. ColeEugene . Moore School o EducationClemson University
Megan RobertsHearing & Speech Sciences DepartmentVanderbilt University
Karen S. AnthonyVanderbilt University
and
Matthew D. Busick
Vanderbilt University
NCSER 2013-3000
U.S. DEPARTMENT OF EDUCATION
-
7/30/2019 Translating the statistical representation of the effects of education interventions
4/54
iv
Tis report was prepared or the National Center or Special Education Research, Institute o Education
Sciences under Contract ED-IES-09-C-0021.
Disclaimer
Te Institute o Education Sciences (IES) at the U.S. Department o Education contracted with Command
Decisions Systems & Solutions to develop a report that assists with the translation o eect size statistics
into more readily interpretable orms or practitioners, policymakers, and researchers. Te views expressedin this report are those o the author and they do not necessarily represent the opinions and positions o the
Institute o Education Sciences or the U.S. Department o Education.
U.S. Department o Education
Arne Duncan, Secretary
Institute o Education Sciences
John Q. Easton, Director
National Center or Special Education ResearchDeborah Speece, Commissioner
November 2012
Tis report is in the public domain. While permission to reprint this publication is not necessary, the
citation should be: Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, M.W., Roberts,
M., Anthony, K.S., Busick, M.D. (2012). ranslating the Statistical Representation o the Eects o
Education Interventions into More Readily Interpretable Forms. (NCSER 2013-3000). Washington, DC:
National Center or Special Education Research, Institute o Education Sciences, U.S. Department o
Education. Tis report is available on the IES website at http://ies.ed.gov/ncser/.
Alternate Formats
Upon request, this report is available in alternate ormats such as Braille, large print, audiotape, or computer
diskette. For more inormation, please contact the Departments Alternate Format Center at 202-260-9895
or 202-205-8113.
Disclosure of Potential Conicts of Interest
Tere are nine authors or this report with whom IES contracted to develop the discussion o the issues
presented. Mark W. Lipsey, Cathy Yun, Kasia Steinka-Fry, Megan Roberts, Karen S. Anthony, and MatthewD. Busick are employees or graduate students at Vanderbilt University; Kelly Puzio is an employee at
Washington State University; Michael A. Hebert is an employee at University o Nebraska-Lincoln; and
Mikel W. Cole is an employee at Clemson University. Te authors do not have nancial interests that could
be aected by the content in this report.
http://ies.ed.gov/ncser/http://ies.ed.gov/ncser/ -
7/30/2019 Translating the statistical representation of the effects of education interventions
5/54
v
Contents
List o Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List o Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Organization and Key Temes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
Inappropriate and Misleading Characterizations o the Magnitude o Intervention Eects . . . . . . 3
Representing Eects Descriptively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Conguring the Initial Statistics that Describe an Intervention Eect to Support
Alternative Descriptive Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
Covariate Adjustments to the Means on the Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
Identiying or Obtaining Appropriate Eect Size Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Descriptive Representations o Intervention Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Representation in erms o the Original Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Standard Scores and Normal Curve Equivalents (NCE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Grade Equivalent Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Assessing the Practical Signifcance o Intervention Eects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Benchmarking Against Normative Expectations or Academic Growth . . . . . . . . . . . . . . . . . . . . . .26
Benchmarking Against Policy-Relevant Perormance Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Benchmarking Against Dierences Among Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Benchmarking Against Dierences Among Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
Benchmarking Against the Observed Eect Sizes or Similar Interventions . . . . . . . . . . . . . . . . . . .33
Benchmarking Eects Relative to Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
Calculating otal Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
Cost-eectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Cost-beneft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
Reerences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
-
7/30/2019 Translating the statistical representation of the effects of education interventions
6/54
vi
List of Tables
1. Pre-post change dierentials that result in the same posttest dierence . . . . . . . . . . . . . . . . . . . .12
2. Upper percentiles or selected dierences or gains rom a lower percentile . . . . . . . . . . . . . . . . .15
3. Proportion o intervention cases above the mean o the control distribution. . . . . . . . . . . . . . . .19
4. Relationship o the eect size and correlation coecient to the BESD . . . . . . . . . . . . . . . . . . . .20
5. Annual achievement gain: Mean eect sizes across seven nationally-normed tests . . . . . . . . . . . .28
6. Demographic perormance gaps on mean NAEP scores as eect sizes. . . . . . . . . . . . . . . . . . . . .30
7. Demographic perormance gaps on SA 9 scores in a large urban school
district as eect sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
8. Perormance gaps between average and weak schools as eect sizes . . . . . . . . . . . . . . . . . . . . . . .32
9. Achievement eect sizes rom randomized studies broken out by type o test
and grade level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
10. Achievement eect sizes rom randomized studies broken out by type o interventionand target recipients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
11. Estimated costs o two ctional high school interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38
12. Cost-eectiveness estimates or two ctional high school interventions. . . . . . . . . . . . . . . . . . . .41
Table Page
-
7/30/2019 Translating the statistical representation of the effects of education interventions
7/54
vii
List of Figures
1. Pre-post change or the three scenarios with the same posttest dierence . . . . . . . . . . . . . . . . . .13
2. Intervention and control distributions on an outcome variable. . . . . . . . . . . . . . . . . . . . . . . . . .14
3. Percentile values on the control distribution o the means o the control and
intervention groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
4. Proportion o the control and intervention distributions scoring above an
externally dened prociency threshold score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
5. Binomial eect size displayProportion o cases above and below the
grand median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
6. Mean reading grade equivalent (GE) scores o success or all and control samples
[Adapted rom Slavin et al. 1996] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
Figure Page
-
7/30/2019 Translating the statistical representation of the effects of education interventions
8/54
Page intentionally let blank.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
9/54
1
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Introduction
Te superintendent o an urban school district reads an evaluation o the eects o a vocabulary building
program on the reading ability o fth graders in which the primary outcome measure was the CA/5
reading achievement test. Te mean posttest score or the intervention sample was 718 compared to 703
or the control sample. Te vocabulary building program thus increased reading ability, on average, by15 points on the CA/5. According to the report, this dierence is statistically signifcant, but is this a
big eect or a trivial one? Do the students who participated in the program read a lot better now, or just
a little better? I they were poor readers beore, is this a big enough eect to now make them profcient
readers? I they were behind their peers, have they now caught up?
Knowing that this intervention produced a statistically signicant positive eect is not particularly helpul to
the superintendent in our story. Someone intimately amiliar with the CA/5 (Caliornia Achievement est,
5th edition; CB/McGraw Hill 1996) and its scoring may be able to look at these means and understand
the magnitude o the eect in practical terms but, or most o us, these numbers have little inherentmeaning. Tis situation is not unusualthe native statistical representations o the ndings o studies o
intervention eects oten provide little insight into the practical magnitude and meaning o those eects. o
communicate that important inormation to researchers, practitioners, and policymakers, those statistical
representations must be translated into some orm that makes their practical signicance easier to iner. Even
better would be some ramework or directly assessing their practical signicance.
Tis paper is directed to researchers who conduct and report education intervention studies. Its purpose
is to stimulate and guide them to go a step beyond reporting the statistics that emerge rom their analysis
o the dierences between experimental groups on the respective outcome variables. With what is oten
very minimal additional eort, those statistical representations can be translated into orms that allow theirmagnitude and practical signicance to be more readily understood by the practitioners, policymakers, and
even other researchers who are interested in the intervention that was evaluated.
Organization and Key Themes
Te primary purpose o this paper is to provide suggestions to researchers about ways to present statistical
ndings about the eects o educational interventions that might make the nature and magnitude o those
eects easier to understand. Tese suggestions and the related discussion are ramed within the context o
studies that use experimental designs to compare measured outcomes or two groups o participants, one
in an intervention condition and the other in a control condition. Tough this is a common and, in many
ways, prototypical orm or studies o intervention eects, there are other important orms. Tough not
addressed directly, much o what is suggested here can be applied with modest adaptation to experimental
studies that compare outcomes or more than two groups or compare conditions that do not include
a control (e.g., compare dierent interventions), and to quasi-experiments that compare outcomes or
nonrandomized groups. Other kinds o intervention studies that appear in educational research are beyond
the scope o this paper. Most notable among those other kinds are observational studies, e.g., multivariate
-
7/30/2019 Translating the statistical representation of the effects of education interventions
10/54
2
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
analysis o the relationship across schools between natural variation in per pupil unding and student
achievement, and single case research designs such as those oten used in special education contexts to
investigate the eects o interventions or children with low-incidence disabilities.
Te discussion in the remainder o this paper is divided into three main sections, each addressing a relatively
distinct aspect o the issue. Te rst section examines two common, but inappropriate and misleading waysto characterize the magnitude o intervention eects. Its purpose is to caution researchers about the problems
with these approaches and provide some context or consideration o better alternatives.
Te second section reviews a number o ways to represent intervention eects descriptively. Te ocus there
is on how to better communicate the nature and magnitude o the eect represented by the dierence on an
outcome variable between the intervention and control samples. For example, it may be possible to express
that dierence in terms o percentiles or the contrasting proportions o intervention and control participants
scoring above a meaningul threshold value. Represented in terms such as those, the nature and magnitude
o the intervention eect may be more easily understood and appreciated than when presented as means,
regression coecients,p-values, standard errors, and the like.
Te point o departure or descriptive representations o an intervention eect is the set o statistics
generated by whatever analysis the researcher uses to estimate that eect. Most relevant are the means and,
or some purposes, the standard deviations on the outcome variable or the intervention and control groups.
Alternatively, the point o departure might be the eect size estimate, which combines inormation rom the
group means and standard deviations and is an increasingly common and requently recommended way to
report intervention eects. However, not every analysis routine automatically generates the statistics that are
most appropriate or directly deriving alternative descriptive representations or or computing the eect size
statistic as an intermediate step in deriving such representations. Tis second section o the paper, thereore,begins with a subsection that provides advice about obtaining the basic statistics that support the various
representations o intervention eects that are described in the subsections that ollow it.
Te third section o this paper sketches some approaches that might be used to go beyond descriptive
representations to more directly reveal the practical signicance o an intervention eect. o accomplish that,
the observed eect must be assessed in relationship to some externally dened standard, target, or rame o
reerence that carries inormation about what constitutes practical signicance in the respective intervention
domain. Covered in that section are approaches that benchmark eects within such rameworks as normative
growth, dierences between students and schools with recognized practical signicance, the eects ound or
other similar interventions, and cost.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
11/54
3
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Inappropriate and Misleading Characterizations of
the Magnitude of Intervention Effects
Some o the most common ways to characterize the eects ound in studies o educational interventions are
inappropriate or misleading and thus best avoided. Te statistical tests routinely applied to the dierencebetween the means on outcome variables or intervention and control samples, or instance, yield ap value
the estimated probability that a dierence that large would be ound when, in act, there was no dierence
in the population rom which the samples were drawn. Very signicant dierences with, say, p < .001
are oten trumpeted as i they were indicative o especially large and important eects, ones that are more
signicant than i p were only marginally signicant (e.g.,p = .10) or just conventionally signicant (e.g.,
p = .05). Such interpretations are quite inappropriate. Tep-values characterize only statistical signicance,
which bears no necessary relationship to practical signicance or even to the statistical magnitude o the
eect. Statistical signicance is a unction o the magnitude o the dierence between the means, to be sure,
but it is also heavily infuenced by the sample size, the within samples variance on the outcome variable, the
covariates included in the analysis, and the type o statistical test applied. None o the latter is related in anyway to the magnitude or importance o the eect.
When researchers go beyond simply presenting the intervention and control group means and thep-value or
the signicance test o their dierence, the most common way to represent the eect is with a standardized
eect size statistic. For continuous outcome variables, this is almost always the standardized mean dierence
eect sizethe dierence between the means on an outcome variable represented in standard deviation
units. For example, a 10 point dierence between the intervention and control on a reading achievement test
with a pooled standard deviation o 40 or those two samples is .25 standard deviation units, that is, an eect
size o .25.
Standardized mean dierence eect sizes are a useul way to characterize intervention eects or some
purposes. Tis eect size metric, however, has very little more inherent meaning than the simple dierence
between means; it simply transorms that dierence into standard deviation units. Interpreting the
magnitude or practical signicance o an eect size requires that it be compared with appropriate criterion
values or standards that are relevant and meaningul or the nature o the outcome variable, sample,
and intervention condition on which it is based. We will have more to say about eect sizes and their
interpretation later. We raise this matter now only to highlight a widely used but, nonetheless, misleading
standard or assessing eect sizes and, at least by implication, their practical signicance.
In his landmark book on statistical power, Cohen (1977, 1988) drew on his general impression o the
range o eect sizes ound in social and behavioral research in order to create examples o power analysis or
detecting smaller and larger eects. In that context, he dubbed .20 as small, .50 as medium, and .80 as
large. Ever since, these values have been widely cited as standards or assessing the magnitude o the eects
ound in intervention research despite Cohens own cautions about their inappropriateness or such general
use. Cohen was attempting, in an unsystematic way, to describe the distribution o eect sizes one might
nd i one piled up all the eect sizes on all the dierent outcome measures or all the dierent interventions
-
7/30/2019 Translating the statistical representation of the effects of education interventions
12/54
4
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
targeting individual participants that were reported across the social and behavioral sciences. At that level
o generality, one could take any given eect size and say it was in the low, middle, or high range o that
distribution.
Te problem with Cohens broad normative distribution or assessing eect sizes is not the idea o comparing
an eect size with such norms. Later in this paper we will present some norms or eect sizes romeducational interventions and suggest doing just that. Te problem is that the normative distribution used as
a basis or comparison must be appropriate or the outcome variables, interventions, and participant samples
on which the eect size at issue is based. Cohens broad categories o small, medium, and large are clearly
not tailored to the eects o intervention studies in education, much less any specic domain o education
interventions, outcomes, and samples. Using those categories to characterize eect sizes rom education
studies, thereore, can be quite misleading. It is rather like characterizing a childs height as small, medium,
or large, not by reerence to the distribution o values or children o similar age and gender, but by reerence
to a distribution or all vertebrate mammals.
McCartney and Rosenthal (2000), or example, have shown that in intervention areas that involve hard
to change low baserate outcomes, such as the incidence o heart attacks, the most impressively large
eect sizes ound to date all well below the .20 that Cohen characterized as small. Tose small eects
correspond to reducing the incidence o heart attacks by about halan eect o enormous practical
signicance. Analogous examples are easily ound in education. For instance, many education intervention
studies investigate eects on academic perormance and measure those eects with standardized reading
or math achievement tests. As we show later in this paper, the eect sizes on such measures across a wide
range o interventions are rarely as large as .30. By appropriate normsthat is, norms based on empirical
distributions o eect sizes rom comparable studiesan eect size o .25 on such outcome measures is large
and an eect size o .50, which would be only medium on Cohens all encompassing distribution, wouldbe more like huge.
In short, comparisons o eect sizes in educational research with normative distributions o eect sizes to
assess whether they are small, middling, or large relative to those norms should use appropriate norms.
Appropriate norms are those based on distributions o eect sizes or comparable outcome measures rom
comparable interventions targeted on comparable samples. Characterizing the magnitude o eect sizes
relative to some other normative distribution is inappropriate and potentially misleading. Te widespread
indiscriminate use o Cohens generic small, medium, and large eect size values to characterize eect sizes in
domains to which his normative values do not apply is thus likewise inappropriate and misleading.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
13/54
5
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Representing Effects Descriptively
Te starting point or descriptive representations o the eects o an educational intervention is the set
o native statistics generated by whatever analysis scheme has been used to compare outcomes or the
participants in the intervention and control conditions. Tose statistics may or may not provide a valid
estimate o the intervention eect. Te quality o that estimate will depend on the research design, sample
size, attrition, reliability o the outcome measure, and a host o other such considerations. For purposes o
this discussion, we assume that the researcher begins with a credible estimate o the intervention eect and
consider only alternate representations or translations o the native statistics that initially describe that eect.
A closely related alternative starting point or a descriptive representation o an intervention eect is the
eect size estimate. Although the eect size statistic is not itsel much easier to interpret in practical terms
than the native statistics on which it is based, it is useul or other purposes. Most notably, its standardized
orm (i.e., representing eects in standard deviation units) allows comparison o the magnitude o eects
on dierent outcome variables and across dierent studies. It is thus well worth computing and reportingin intervention studies but, or present purposes, we include it among the initial statistics or which an
alternative representation would be more interpretable by most users.
In the ollowing parts o this section o the paper, we rst provide advice or conguring the native statistics
generated by common analyses in a orm appropriate or supporting alternate descriptive representations.
We include in that discussion advice or conguring the eect size statistic as well in a ew selected situations
that oten cause conusion.
Conguring the Initial Statistics that Describe an InterventionEffect to Support Alternative Descriptive Representations
Covariate Adjustments to the Means on the Outcome Variable
Several o the descriptive representations o intervention eects described later are derived directly rom
the means and perhaps the standard deviations on the outcome variable or the intervention and control
groups. However, the observedmeans or the intervention and control groups may not be the best choice or
representing an intervention eect. Te dierence between those means refects the eect o the intervention,
to be sure, but it may also refect the infuence o any initial baseline dierences between the intervention
and control groups. Te value o random assignment to conditions, o course, is that it permits only chancedierences at baseline, but this does not mean there will be no dierences, especially i the samples are not
large. Moreover, attrition rom posttest measurement undermines the initial randomization so that estimates
o eects may be based on subsets o the intervention and control samples that are not ully equivalent on
their respective baseline characteristics even i the original samples were.
Researchers oten attempt to adjust or such baseline dierences by including the respective baseline values
as covariates in the analysis. Te most common and useul covariate is the pretest or the outcome measure
along with basic demographic variables such as age, gender, ethnicity, socioeconomic status, and the like.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
14/54
6
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Indeed, even when there are no baseline dierences to account or, the value o such covariates (especially the
pretest) or increasing statistical power is so great that it is advisable to routinely include any covariates that
have substantial correlations with the posttest (Rausch, Maxwell, Kelley 2003). With covariates included in
the analysis, the estimation o the intervention eect is the dierence between the covariate-adjustedmeans
o the intervention and control samples. Tese adjusted values better estimate the actual intervention eect
by reducing any bias rom the baseline dierences and thus are the best choices or use in any descriptive
representation o that eect. When that representation involves the standard deviations, however, their
values should not be adjusted or the infuence o the covariates. In virtually all such instances, the standard
deviations are used as estimates o the corresponding population standard deviations on the outcome
variables without consideration or the particular covariates that may have been used in estimating the
dierence on the means.
When the analysis is conducted in analysis o covariance ormat (ANCOVA), most statistical sotware has an
option or generating the covariate-adjusted means. When the analysis is conducted in multiple regression
ormat, the unstandardized regression coecient or the intervention dummy code (intervention=1,control=0; or +0.5 vs. -0.5) is the dierence between the covariate-adjusted means. In education, analyses o
intervention eects are oten multilevel when the outcome o interest is or students or teachers who, in turn
are nested within classrooms, schools, or districts. Using multilevel regression analysis, e.g., HLM, does not
change the situation with regard to the estimate o the dierence between the covariate-adjusted meansit
is still the unstandardized regression coecient on the intervention dummy code. Te unadjusted standard
deviations or the intervention and control groups, in turn, can be generated directly by most statistical
programs, though that option may not be available within the ANCOVA, multiple regression, or HLM
routine itsel.
For binary outcomes, such as whether students are retained in grade, placed in special education status, orpass an exam, the analytic model is most oten logistic regression, a specialized variant o multiple regression
or binary dependent variables. Te regression coecient () in a logistic regression or the dummy coded
variable representing the experimental condition (e.g., 1=intervention, 0=control) is a covariate-adjusted log
odds ratio representing the intervention eect (Crichton 2001). Unlogging it (exp) produces the covariate-
adjusted odds ratio or the intervention eect, which can then be converted back into the terms o the
original metric.
For example, an intervention designed to improve the passing rate on an algebra exam might produce
the results shown below. Te odds o passing or a given group are dened as the ratio o the number
(or proportion) who pass to the number (or proportion) who ail. For the intervention group, thereore,the odds o passing are 45/15=3.0 and, or the control group, the odds are 30/30=1.0. Te odds ratio
characterizing the intervention eect is the ratio o these two values, that is 3/1=3, and indicates that the
odds o passing are three times greater or a student in the intervention group than or one in the control
group.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
15/54
7
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Passed Failed
Intervention 45 15
Control 30 30
Suppose the researcher analyzes these outcomes in a logistic regression model with race, gender, and priormath achievement scores included as covariates to control or initial dierences between the two groups.
I the coecient on the intervention variable in that analysis, converted to a covariate-adjusted odds-
ratio, turns out to be 2.53, it indicates that the unadjusted odds ratio overestimated the intervention eect
because o baseline dierences that avored the intervention group. With this inormation, the researcher
can construct a covariate-adjusted version o the original 2x2 table that estimates the proportions o students
passing in each condition when the baseline dierences are taken into account. o do this, the requencies
or the control sample and the total N or the intervention sample are taken as given. We then want to know
what passing requency, p, or the intervention group allows the odds ratio, (p30)/((60-p)30), to equal
2.53. Solving orp reveals that it must be 43. Te covariate-adjusted results, thereore, are as shown below.
Described as simple percentages, the covariate-adjusted estimate is that the intervention increased the 50%pass rate o the control condition to 72% (43/60) in the intervention condition.
Passed Failed
Intervention 43 17
Control 30 30
Identifying or Obtaining Appropriate Effect Size Statistics
A number o the ways o representing intervention eects and assessing their practical signicance describedlater in this paper can be derived directly rom the standardized mean dierence eect size statistic,
commonly reerred to simply as the eect size. Tis eect size is dened as the dierence between the mean
o the intervention group and the mean o the control group on a given outcome measure divided by the
pooled standard deviations or those two groups, as ollows:
Where is the mean o the intervention sample on an outcome variable, Cis the mean o the controlsample on that variable, and sp is the pooled standard deviation. Te pooled standard deviation is obtained as
the square root o the weighted mean o the two variances, dened as:
where nand nCare the number o respondents in the intervention and control groups, and sand sCare
the respective standard deviations on the outcome variable or the intervention and control groups.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
16/54
8
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Te eect size is typically reported to two decimal places and, by convention, has a positive value when the
intervention group does better on the outcome measure than the control group and a negative sign when it
does worse. Note that this may not be the same sign that results rom subtraction o the control mean rom
the intervention mean. For example, i low scores represent better perormance, e.g., as with a measure o the
number o errors made, then subtraction will yield a negative value when the intervention group perorms
better than the control, but the eect size typically would be given a positive sign to indicate the betterperormance o the intervention group.
Eect sizes can be computed or estimated rom many dierent kinds o statistics generated in intervention
studies. Inormative sources or such procedures include the What Works Clearinghouse Procedures and
Standards Handbook(2011; Appendix B) and Lipsey and Wilson (2001; Appendix B). Here we will
only highlight a ew eatures that may help researchers identiy or congure appropriate eect sizes or
use in deriving alternative representations o intervention eects. Moreover, many o these eatures have
implications or statistics other than the eect size that are involved in some representations o intervention
eects.
Clear understanding o what the numerator and denominator o the standardized mean dierence eect size
represent will allow many common mistakes and conusions in the computation and interpretation o eect
sizes to be avoided. Te numerator o the eect size estimates the dierence between the experimental groups
on the means o the outcome variable that is attributable to the intervention. Tat is, the numerator should
be the best estimate available o the mean intervention eect estimated in the units o the original metric.
As described in the previous subsection, when researchers include baseline covariates in the analysis, the best
estimate o the intervention eect is the dierence between the covariate-adjusted means on the outcome
variable, not the dierence between the unadjusted means.
Te purpose o the denominator o the eect size is to standardize the dierence between the outcome
means in the numerator into metric ree standard deviation units. Te concept ostandardization is critical
here. Standardization means that each eect size is represented in the same way, i.e., in a standard way,
irrespective o the outcome construct, the way it is measured, or the way it is analyzed. Te sample standard
deviations used or this purpose estimate the corresponding population standard deviations on the outcome
measure. As such, the standard deviations should not be adjusted by any covariates that happened to be used
in the design or analysis o the particular study. Such adjustments would not have general applicability to
other designs and measures and thus would compromise the standardization that is the point o representing
the intervention eect in standard deviation units. Tis means that the raw standard deviations or the
intervention and control samples should be pooled into the eect size denominator, even when multilevelanalysis models with complex variance structures are used.
Pooling the sample standard deviations or the intervention and control groups is intended to provide the
best possible estimate o the respective population standard deviation by using all the data available. Tis
procedure assumes that both those standard deviations estimate a common population standard deviation.
Tis is the homogeneity o variance assumption typically made in the statistical analysis o intervention
eects. I homogeneity o variance cannot be assumed, then consideration has to be given to the reason why
-
7/30/2019 Translating the statistical representation of the effects of education interventions
17/54
9
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
the intervention and control group variances dier. In a randomized experiment, this should not occur on
outcome variables unless the intervention itsel aects the variance in the intervention condition. In that
case, the better estimate may be the standard deviation o the control group even though it is estimated on a
smaller sample than the pooled version.
In the multilevel situations common in education research, a related matter has to do with the populationthat is relevant or purposes o standardizing the intervention eect. Consider, or example, outcomes on an
achievement test that is, or could be, used nationally. Te variance or the national population o students
can be partitioned into between and within components according to the dierent units represented at
dierent levels. Because state education systems dier, we might rst distinguish between-state and within-
state variance. Within states, there would be variation between districts; within districts, there would
be variation between schools; within schools, there would be variation between classrooms; and within
classrooms, there would be variation between students. Te total variance or the national population can
thus be decomposed as ollows (Hedges 2007):
In an intervention study using a national sample, the sample estimate o the standard deviation includes all
these components. Any eect size computed with that standard deviation is thus standardizing the eect size
with the national population variance as the reerence value. Te standard deviation computed in a study
using a sample o students rom a single classroom, on the other hand, estimates only the variance o the
population o students who might be in that classroom in that school in that district in that state. In other
words, this standard deviation does not include the between classroom, between school, between district,
and between state components that would be included in the estimate rom a national sample. Similarly,
an intervention study that draws its sample rom one school, or one district, will yield a standard deviationestimate that is implicitly using a narrower population as the basis or standardization than a study with a
broader sample. Tis will not matter i there are no systematic dierences on the respective outcome measure
between students in dierent states, districts, schools, and classrooms, i.e., those variance components are
zero. With student achievement measures, we know this is generally not the case (e.g., Hedges and Hedberg
2007). Less evidence is available or other measures used in education intervention studies, but it is likely
that most o them also show nontrivial dierences between these dierent units and levels.
Any researcher computing eect sizes or an intervention study or using them as a basis or alternative
representations o intervention eects should be aware o this issue. Eect sizes based on samples o narrower
populations will be larger than eect sizes based on broader samples even when the actual magnitudes o theintervention eects are identical. And, that dierence will be carried through to any other representation o
the intervention eect that is based on the eect size. Compensating or that dierence, i appropriate, will
require adding or subtracting estimates o the discrepant variance components, with the possibility that those
components will have to be estimated rom sources outside the research sample itsel.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
18/54
10
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Te discussion above assumes that the units on which the sample means and standard deviations are
computed or an outcome variable are individuals, e.g., students. Te nested data structures common in
education intervention studies, however, provide dierent units on which means and standard deviations
can be computed, e.g., students, clusters o students in classrooms, and clusters o classrooms in schools.
For instance, in a study o a whole school intervention aimed at improving student achievement, with
some schools assigned to the intervention condition and others to the control, there are two eect sizes theresearcher could estimate. Te conventional eect size would standardize the intervention eect estimated
on student scores using the pooled student level standard deviations. Alternatively, the student level scores
might be aggregated to the school level and the school level means could be used to compute an eect size.
Tat eect size would represent the intervention eect in standard deviation units that refect the variance
between schools, not that between students. Te result is a legitimate eect size, but the school units on
which it is based make this eect size dierent rom the more conventional eect size that is standardized on
variation between individuals.
Te numerators o these two eect sizes would not necessarily dier greatly. Te respective means o the
student scores in the intervention and control groups would be similar to the means o the school-level
means or those same students unless the number o students in each school diers greatly and is correlated
with the school means. However, the standard deviations will be quite dierent because the variance
between schools is only one component o the total variance between students. Between-school variance on
achievement test scores is typically around 20-25% o the total variance, the intraclass correlation coecient
(ICC) or schools (Hedges and Hedberg 2007). Te between schools standard deviation thus will be about
= .50 or less o the student level standard deviation and the eect size based on school units will be
about twice as large as the eect size based on students as the units even though both describe the same
intervention eect.
Similar situations arise in multilevel samples whenever the units on which the outcome is measured are
nested within higher level clusters. Each such higher level cluster allows or its own distinctive eect size
to be computed. A researcher comparing eect sizes in such situations or, more to the point or present
purposes, using an eect size to derive other representations o intervention eects, must know which eect
size is being used. An eect size standardized on a between-cluster variance component will nearly always be
larger than the more conventional eect size standardized on the total variance across the lower level units
on which the outcome was directly measured. Tat dierence in numerical magnitude will then be carried
into any alternate representation o the intervention eect based on that eect size and the results must be
interpreted accordingly.
Descriptive Representations of Intervention Effects
Representation in Terms of the Original Metric
Beore looking at dierent ways o transorming the dierence between the means o the intervention
and control samples into a dierent orm, we should rst consider those occasional situations in which
dierences on the original metric are easily understood without such manipulations. Tis occurs when the
-
7/30/2019 Translating the statistical representation of the effects of education interventions
19/54
11
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
units on the measure are suciently amiliar and well dened that little urther description or interpretation
is needed. For example, an outcome measure or a truancy reduction program might be the proportion
o days on which attendance was expected or which the student was absent. Te outcome score or each
student, thereore, is a simple proportion and the corresponding value or the intervention or control
groups is the mean proportion o days absent or the students in that group. Common events o this sort in
education that can be represented as counts or proportions include dropping out o school, being expelledor suspended, being retained in grade, being placed in special education status, scoring above a prociency
threshold on an achievement test, completing assignments, and so orth.
Intervention eects on outcome measures that involve well recognized and easily understood events can
usually be readily interpreted in their native orm by researchers, practitioners, and policymakers. Some
caution is warranted, nevertheless, in presenting the dierences between intervention and control groups
in terms o the proportions o such events. Dierences between proportions can have dierent implications
depending on whether those dierences are viewed in absolute or relative terms. Consider, or example,
a dierence o three percentage points between the intervention and control groups in the proportion
suspended during the school year. Viewed in absolute terms, this appears to be a small dierence. But relative
to the suspension rate or the control sample a three point decrease might be substantial. I the suspension
rate or the control sample is only 5%, or instance, a decrease o three percentage points reduces that rate by
more than hal. On the other hand, i the control sample has a suspension rate o 40%, a reduction o three
percentage points might rightly be viewed as rather modest.
In some contexts, the numerical values on an outcome measure that does not represent amiliar events may
still be suciently amiliar that dierences are well-understood despite having little inherent meaning.
Tis might be the case, or instance, with widely used standardized tests. For example, the Peabody Picture
Vocabulary est (PPV; Dunn and Dunn 2007), one o the most widely used tests in education, is normedso that standard scores have a mean o 100 or the general population o children at any given age. Many
researchers and educators have sucient experience with this test to understand what scores lower or
higher than 100 indicate about childrens skill level and how much o an increase constitutes a meaningul
improvement. Generally speaking, however, such amiliarity with the scoring o a particular measure o this
sort is not widespread and most audiences will need more inormation to be able to interpret intervention
eects expressed in terms o the values generated by an outcome measure.
Intervention Efects in Relation to Pre-Post Change
When pretest measures o an outcome variable are available, the pretest means may be used to provide an
especially inormative representation o intervention eects using the original metric. Tis ollows rom the
act that the intent o interventions is to bring about change in the outcome; that is, change between pretest
and posttest. Te ull representation o the intervention eect, thereore, is not simply the dierence between
the intervention and control samples on the outcome measure at posttest, but the dierential change
between pretest and posttest on that outcome. By showing eects as dierential change, the researcher
reveals not only the end result but the patterns o improvement or decline that characterize the intervention
and control groups.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
20/54
12
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Consider, or example, a program that instructs middle school students in confict resolution techniques
with the objective o decreasing interpersonal aggression. Student surveys at the beginning and end o the
school year are administered or intervention and control schools that provide composite scores or the
amount o physical, verbal, and relational aggression students experience. Tese surveys show signicantly
lower levels or the intervention schools than the control schools, indicating a positive eect o the confict
resolution program, say a mean o 23.8 or the overall total score or the students in the interventionschools and 27.4 or the students in the control schools. Tat 3.6 point avorable outcome dierence,
however, could have come rom any o a number o dierent patterns o change over the school year or
the intervention and control schools. able 1 below shows some o the possibilities, all o which assume an
eective randomization so that the pretest values at the beginning o the school year were virtually identical
or the intervention and control schools.
Table 1. Pre-post change differentials that result in the same posttest difference
Scenario A Scenario B Scenario C
Pretest Posttest Pretest Posttest Pretest Posttest
Intervention 25.5 23.8 17.7 23.8 22.9 23.8
Control 25.6 27.4 17.6 27.4 23.0 27.4
As can be seen even more clearly in Figure 1, or Scenario A the aggression levels decreased somewhat in the
intervention schools while increasing in the control schools. In Scenario B, the aggression levels increased
quite a bit (at least relative to the intervention eect) in both samples, but the amount o the increase was
not as great in the intervention schools as the control schools. In Scenario C, on the other hand, there was
little change in the reported level o aggression over the course o the year in the intervention schools, but
things got much worse during that time in the control schools. Tese dierent patterns o dierential pre-post change depict dierent trajectories or aggression absent intervention and give dierent pictures o
what it is that the intervention accomplished. In Scenario A it reversed the trend that would have otherwise
occurred. In Scenario B, it ameliorated an adverse trend, but did not prevent it rom getting worse. In
Scenario C, the intervention did not produce appreciable improvement over time, but kept the amount o
aggression rom getting worse.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
21/54
13
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Figure 1. Pre-post change for the three scenarios with the same posttest difference
Pretest Posttest
31
29
27
25
23
21
19
17
15
Intervention Control Intervention Control Intervention Control
Scenario A
Pretest Posttest
31
29
27
25
23
21
19
17
15
Scenario B
Pretest Posttest
31
29
27
25
23
21
19
17
15
Scenario C
As this example illustrates, a much uller picture o the intervention eect is provided when the dierence
between the intervention and control samples on the outcome variable is presented in relation to where those
samples started at the pretest baseline. A ner point can be put on the dierential change or the
intervention and control groups, i desired, by proportioning the intervention eect against the control
group pre-post change. In Scenario B above, or instance, the dierence between the control groups pretest
and posttest composite aggression scores is 9.8 (27.4 - 17.6) while the posttest dierence between the
intervention and control group is -3.6 (23.8 - 27.4). Te intervention, thereore, reduced the pre-post
increase in the aggression score by 36.7% (-3.6/9.8).
Overlap Between Intervention and Control Distributions
I the distributions o scores on an outcome variable were plotted separately or the intervention and
control samples, they might look something like Figure 2 below. Te magnitude o the intervention eect
is represented directly by the dierence between the means o the two distributions. Te standardized
mean dierence eect size, discussed earlier, also represents the dierence between the two means, but
does so in standard deviation units. Still another way to represent the dierence between the outcomes
or the intervention and control groups is in terms o the overlap between their respective distributions.
When the dierence between the means is larger, the overlap is smaller; when the dierence between the
means is smaller, the overlap is larger. Te amount o overlap, in turn, can be described in terms o the
proportion o individuals in each distribution who are above or below a specied reerence point on oneo the distributions. Proportions o this sort are oten easy to understand and appraise and, thereore, may
help communicate the magnitude o the eect. Various ways to take advantage o this circumstance in the
presentation o intervention eects are described below.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
22/54
14
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Figure 2. Intervention and control distributions on an outcome variable
Intervention Efects Represented in Percentile Form
For outcomes assessed with standardized measures that have norms, a simple representation o the
intervention eect is to characterize the means o the control and intervention samples according to the
percentile values they represent in the norming distribution. For a normed, standardized measure, these
values are oten provided as part o the scoring scheme or that measure. On a standardized math reading
achievement test, or instance, suppose that the mean o the control sample ell at the 47th percentile
according to the test norms or the respective age group and the mean o the intervention sample ell at
the 52nd percentile. Tis tells us, rst, that the mean outcome or the control sample is somewhat below
average perormance (50th percentile) relative to the norms and, second, that the eect o the intervention
was to improve perormance to the point where it was slightly above average. In addition, we see that
the corresponding increase was 5 percentile points. Tat 5 percentile point dierence indicates that the
individuals receiving the intervention, on average, have now caught up with the 5% o the norming
population that otherwise scored just above them.
In their study o each or America, Decker, Mayer, and Glazerman (2004) used percentiles in this way to
characterize the statistically signicant eect they ound on student math achievement scores. Decker et al.
also reported the pretest means as percentiles so that the relative gain o the intervention sample was evident.
Tis representation revealed that the students in the control sample were at the 15th percentile at both thepretest and posttest whereas the intervention sample gained 3 percentiles by moving rom the 14th to the
17th percentile.
It should be noted that the percentile dierences on the norming distribution that are associated with a
given dierence in scores will vary according to where the scores all in the distribution. able 2 shows the
percentile levels or the mean score o the lower scoring experimental group (e.g., control group when its
mean score is lower than that o the treatment group) in the rst column. Te numbers in the body o the
table then show the corresponding percentile level o the other group (e.g., treatment) that are associated
-
7/30/2019 Translating the statistical representation of the effects of education interventions
23/54
15
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
with a range o score dierences represented in standard deviation units (which thereore means these
dierences can also be interpreted as standardized mean dierence eect sizes). As shown there, i one group
scores at the 50th percentile and the other has a mean score that is .50 standard deviations higher, that
higher group will be at the 69th percentile or a dierence o 19 in the percentile ranking. A .50 standard
deviation dierence between a group scoring at the 5th percentile and a higher scoring group will put that
other group at the 13th percentile or a dierence o only 8 in the percentile ranking. Tis same pattern ordierences between two groups also applies to pre-post gains or one group. Researchers should thereore be
aware that intervention eects represented as percentile dierences or gains on a normative distribution can
look quite dierent depending on whether the respective scores all closer to the middle or the extremes o
the distribution.
Table 2. Upper percentiles for selected differences or gains from a lower percentile
LowerPercentile
Dierence or Gain in Standard Deviations
.10 .20 .50 .80 1.00
5th 6 7 13 20 26
10th 12 14 22 32 39
15th 17 20 30 41 48
25th 28 32 43 54 62
50th 54 58 69 79 84
75th 78 81 88 93 95
85th 87 89 94 97 98
90th 92 93 96 98 99
95th 96 97 98 99 99NOTE: Table adapted from Albanese (2000).
A similar use o percentiles can be applied to the outcome scores in the intervention and control groups
when those scores are not reerenced to a norming distribution. Te distribution o scores or the control
group, which represents the situation in the absence o any infuence rom the intervention under study,
can play the role o the norming distribution in this application. Te proportion o scores alling below the
control group and intervention group means can then be transormed into the corresponding percentile
values on the control distribution. Tese values can be obtained rom the cumulative requency tables that
most statistical analysis computer programs readily produce or the values on any variable. For a symmetrical
distribution, the mean o the control sample will be at the 50th percentile (the median). Te mean scoreor the intervention sample can then be represented in terms o its percentile value on that same control
distribution. Tus we may nd that the mean or the intervention group alls at the 77th percentile o the
control distribution, indicating that its mean is now higher than 77% o the scores in the control sample.
With a control group mean at the 50th percentile, another way o describing the dierence is that the
intervention has moved 27% o the sample rom a score below the control mean to one above that mean.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
24/54
16
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Tis comparison is shown in Figure 3.
Figure 3. Percentile values on the control distribution of the means of the control and
intervention groups
Intervention Efects Represented as Proportions Above or Below a Reerence Value
Te representation o an intervention eect as the percentiles on a reerence distribution, as described
above, is based on the proportions o the respective groups above and below a specic threshold value on
that reerence distribution. A useul variant o this approach is to select an inormative threshold value on
the control distribution and depict the intervention eect in terms o the proportion o intervention cases
above or below that value in comparison to the corresponding proportions o control cases. Te result then
indicates how many more o the intervention cases are in the desirable range dened by that threshold thanexpected without the intervention.
When available, the most meaningul threshold value or comparing proportions o intervention and control
cases is one externally dened to have substantive meaning in the intervention context. Such threshold
values are oten dened or criterion-reerenced tests. For example, thresholds have been set or the National
Assessment o Educational Progress (NAEP) achievement tests with cuto scores that designate Basic,
Proicient, andAdvancedlevels o perormance. On the NAEP math achievement test, or instance, scores
between 299 and 333 are identied as indicating that 8th grade students are procient. I we imagine that
we might assess a math intervention using the NAEP test, we could compare the proportion o students
in the intervention versus control conditions who scored 300 or above
that is, were at least minimallyprocient. Figure 4 shows what the results might look like. In this example, 36% o the control students
scored above that threshold level whereas 45% o the intervention students did so.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
25/54
17
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Figure 4. Proportion of the control and intervention distributions scoring above an externally
dened prociency threshold score
Similar thresholds might be available rom the norming data or a standardized measure. For example, the
mean standard score or the Peabody Picture Vocabulary test (PPV; Dunn and Dunn 2007) is 100, which
is, thereore, the mean age-adjusted score in the norming sample. Assuming representative norms, that scorerepresents the population average or children o any given age. For an intervention with the PPV as an
outcome measure, the intervention eect could be described in terms o the proportion o children in the
intervention versus control samples scoring 100 or above. I the proportion or either sample is at least .50, it
tells us that their perormance is average or their age. Suppose that or a control group, 32% scored 100 or
above at posttest, identiying them as a low perorming sample. I 38% o the intervention group scored 100
or above, we see that the eect o the intervention has been to move 6% o the children rom the below
average to the above average range. At the same time, we see that this has not been sucient to close the gap
between them and normative perormance on this measure.
With a little eort, researchers may be able to identiy meaningul threshold values or measures that donot already have one dened. Consider a multi-item scale on which teachers rate the problem behavior o
students in their classrooms. When pretest data are collected on this measure, the researcher might also ask
each teacher to nominate several children who are borderlinenot presenting signicant behavior problems,
but close to that point. Te scores o those children could then be used to identiy the approximate point on
the rating scale at which teachers begin to view the classroom behavior o a child as problematic. Tat score
then provides a threshold value that allows the researcher to describe the eects o, say, a classroom behavior
management program in terms o how many ewer students in the intervention condition than the control
-
7/30/2019 Translating the statistical representation of the effects of education interventions
26/54
18
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
condition all in the problem range. Dierences on the means o an arbitrarily scored multi-item rating scale,
though critical to the statistical analysis, are not likely to convey the magnitude o the eect as graphically as
this translation into proportions o children above a threshold teachers themselves identiy.
Absent a substantively meaningul threshold value, an inormative representation o the intervention eect
might still be provided with a generic threshold value. Cohen (1988), or instance, used the control groupmean as a general threshold value to create an index he called U3, one o several indices he proposed to
describe the degree o non-overlap between control and intervention distributions. Te example shown in
Figure 3, presented earlier to illustrate the use o percentile values, similarly made the control group mean
the key reerence value.
With the actual scores in hand or the control and intervention groups, it is straightorward or a researcher
to determine the proportion o each above (or below) the control mean. Assuming normal distributions,
those proportions and the corresponding percentiles or the control and intervention means can easily
be linked to the standardized mean dierence eect size through a table o areas under the normal curve.
Te mean o a normally distributed control sample is at the 50th percentile with a z-score o zero. Adding
the standardized mean dierence eect size to that z-score then identies the z-score o the intervention
mean on the control distribution. With a table o areas under the normal curve, that z-score, in turn, can
be converted to the equivalent percentile and proportions in the control distribution. able 3 shows the
proportion o intervention cases above the control sample mean or dierent standardized mean dierence
eect size values, assuming normal distributionsCohens (1988) U3 index. In each case, the increase over
.50 indicates the additional proportion o the cases that the intervention has pushed above that control
condition mean.
Rosenthal and Rubin (1982) described yet another generic threshold or comparing the relative proportionso the control and intervention groups attaining it within a ramework they called the Binomial Eect
Size Display (BESD). In this scheme, the key success threshold value is the grand median o the combined
intervention and control distributions. When there is no intervention eect, the means o both the
intervention and control distributions all at that grand median. As the intervention eect gets larger and
the intervention and control distributions separate, smaller proportions o the control distribution and larger
proportions o the intervention distribution all above that grand median. Figure 5 depicts this situation.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
27/54
19
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Table 3. Proportion of intervention cases above the mean of the control distribution
Eective SizeProportion above the
Control Mean Eect SizeProportion abovethe Control Mean
.10 .54 1.30 .90
.20 .58 1.40 .92
.30 .62 1.50 .93
.40 .66 1.60 .95
.50 .69 1.70 .96
.60 .73 1.80 .96
.70 .76 1.90 .97
.80 .79 2.00 .98
.90 .82 2.10 .98
1.00 .84 2.20 .99
1.10 .86 2.30 .99
1.20 .88 2.40 .99
Figure 5. Binomial effect size displayProportion of cases above and below the grand median
Using the grand median as the threshold value makes the proportion o the intervention sample above the
threshold value equal to the proportion o the control sample below that value. Te dierence between these
proportions, which Rosenthal and Rubin called the BESD Index, indicates how many more intervention
cases are above the grand median than control cases. Assuming normal distributions, the BESD can also be
linked to the standardized mean dierence eect size. An additional and occasionally convenient eature o
the BESD is that it is equal to the eect size expressed as a correlation; that is, the correlation between the
treatment variable (coded as 1 vs. 0) and the outcome variable. Many researchers are more amiliar with
-
7/30/2019 Translating the statistical representation of the effects of education interventions
28/54
20
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
correlations than standardized mean dierences, so the magnitude o the eect expressed as a correlation may
be somewhat more interpretable or them. able 4 shows the proportions above and below the grand median
and the BESD as the intervention eect sizes get larger. It also shows the corresponding correlational
equivalents or each eect size and BESD.
Table 4. Relationship of the effect size and correlation coefcient to the BESD
Eective Size r
Proportion o control/intervention cases above
the grand median
BESD(dierence between the
proportions)
.10 .05 .47 / .52 .05
.20 .10 .45 / .55 .10
.30 .15 .42 / .57 .15
.40 .20 .40 / .60 .20
.50 .24 .38 / .62 .24
.60 .29 .35 / .64 .29
.70 .33 .33 / .66 .33
.80 .37 .31 / .68 .37
.90 .41 .29 / .70 .41
1.00 .45 .27 / .72 .45
1.10 .48 .26 / .74 .48
1.20 .51 .24 / .75 .51
1.30 .54 .23 / .77 .54
1.40 .57 .21 / .78 .57
1.50 .60 .20 / .80 .60
1.60 .62 .19 / .81 .62
1.70 .65 .17 / .82 .65
1.80 .67 .16 / .83 .67
1.90 .69 .15 / .84 .69
2.00 .71 .14 / .85 .71
2.10 .72 .14 / .86 .72
2.20 .74 .13 / .87 .742.30 .75 .12 / .87 .75
2.40 .77 .11 / .88 .77
All the variations on representing the proportions o the intervention and control group distributions
above or below a threshold value require dichotomizing the respective distributions o scores. It should be
noted that we are not advocating that the statistical analysis be conducted on any such dichotomized data.
It is well known that such crude dichotomizations discard useul data and generally weaken the analysis
(Cohen 1983; MacCallum et al. 2002). What is being suggested is that, ater the ormal statistical analysis
-
7/30/2019 Translating the statistical representation of the effects of education interventions
29/54
21
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
is done and the results are known, depicting intervention eects in one o the ways described here may
communicate their magnitude and practical implications better than means, standard deviations, t-test
values, and the other native statistics that result directly rom the analysis.
In applying any o these techniques, some consideration should be given to the shape o the respective
distributions. When the outcome scores are normally distributed, the application o these techniques isrelatively tidy and straightorward. When the data are not normally distributed, the respective empirical
distributions can always be dichotomized to determine what proportions o cases are above or below any
given reerence value o interest, but the linkage between those proportions and other representations may
be problematic or misleading. Percentile values, and dierences in those values, or instance, take on quite
a dierent character in skewed distributions with long tails than in normal distributions, as do the standard
deviation units in which standardized mean dierence eect sizes are represented.
Standard Scores and Normal Curve Equivalents (NCE)
Standard scores are a conversion o the raw scores on a norm reerenced test that draws upon the normingsample used by the test developer to characterize the distribution o scores expected rom the population or
which the test is intended. A linear transorm o the raw scores is applied to produce tidier numbers or the
mean and standard deviation. For many standardized measures, or instance, the standard score mean may
be set at 100 with a standard deviation o 15.
Presenting intervention eects in terms o standard scores can make those eects easier to understand in
some regards. For example, the mean scores or the intervention and control groups can be easily assessed
in relation to the mean or the norming sample. Mean scores below the standardized mean score, e.g., 100,
indicate that the sample, on average, scores below the mean or the population represented in the norming
sample. Similarly, a standard score mean o, say, 95 or the control group and 102 or the intervention groupindicates that the eect o the intervention was to improve the scores o an underperorming group to the
point where their scores were more typical o the average perormance o the norming sample.
An important characteristic o standard scores or tests and measures used to assess student perormance is
that those scores are typically adjusted or the age o the respective students. Te population represented in
the norming sample rom which the standard scores are derived is divided into age or school grade groups
and the standard scores are determined or each group. Tus the standard scores or, say, the students in the
norming sample who are in the ourth grade and average 9 years o age may be scaled to have a mean o 100
and a standard deviation o 15, but so will the standard scores or the students in the sixth grade with an
average age o 11 years. Dierent standardized measures may use dierent age groups or this purpose, e.g.,
diering by as little as a month or two or as much as a year or more.
Tese age adjustments o standard scores have implications or interpreting changes in those scores over time
because those changes are depicted relative to the change or same aged groups in the norming sample. A
control sample with a mean standard score o 87 on the pretest and a mean score o 87 on the posttest a year
later has not ailed to make gains but, rather, has simply kept abreast o the dierences by age in the norming
sample. On the other hand, an intervention group with a mean pretest standard score o 87 and mean
-
7/30/2019 Translating the statistical representation of the effects of education interventions
30/54
22
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
posttest score o 95 has improved at a rate aster than that represented in the comparable age dierences in
the norming sample. Tis characteristic allows some interpretation o the extent to which intervention eects
accelerate growth, though that depends heavily on the assumption that the sample used in the intervention
study is representative o the norming sample used by the test developer.
Reporting intervention eects in standard score units thus has some modest advantages or interpretabilitybecause o the implicit comparison with the perormance o the norming sample. Moreover, the means and
standard deviations or standard scores are usually assigned simple round numbers that are easy to remember
when making such comparisons. In other ways standard scores are not so tidy. Most notably, standard scores
typically have a rather odd range. With a normal distribution encompassing more than 99% o the scores
within 3 standard deviations, standard scores with a mean o 100 and a standard deviation o 15 will range
rom about 55 at the lowest to about 145 at the highest. Tese are not especially intuitive numbers or the
bottom and top o a measurement scale. For this reason, researchers may preer to represent treatment eects
in terms o some variant o standard scores. One such variant that is well known in education is the normal
curve equivalent.
Normal curve equivalents. Normal curve equivalents (NCE) are a metric developed in 1976 or the U.S.
Department o Education or reporting scores on norm-reerenced tests and allowing comparison across tests
(Hills 1984; allmadge and Wood 1976). NCE scores are standard scores based on an alternative scaling o
the z-scores or measured values in a normal distribution derived rom the norming sample or the measure.
Unlike the typical standard score, as described above, NCE scores are scaled so that they range rom a low
around 0 to a high o around 100, with a mean o 50. NCE scores, thereore, allow scores, dierences in
scores, and changes in scores to be appraised on a 100 point scale that starts at zero.
NCE scores are computed by rst transorming the original raw scores into normalized z-scores. Te z-scoreis the original score minus the mean or all the scores divided by the standard deviation; it indicates the
number o standard deviations above or below a mean o zero that the score represents. Te NCE score is
then computed as NCE = 21.06(z-score) + 50; that is, 21.06 times the z-score plus 50. Tis produces a set
o NCE scores with a mean o 50 and a standard deviation o 21.06. Note that the standard deviation or
NCE scores is not as tidy as the round number typically used or other standard scores, but it is required to
produce the other desirable characteristics o NCE scores.
As a standard score, NCEs are comparable across all the measures that derive and provide NCE scores
rom their norming samples i those samples represent the same population. Tus while a raw score o 82
on a particular reading test would not be directly comparable to the same numerical score on a dierentreading test measuring the same construct but scaled in a dierent way (i.e., a dierent mean and standard
deviation), the corresponding NCE scores could be compared. For example, i the NCE score corresponding
to 82 on the rst measure was 68 and that corresponding to 82 on the second measure was 56, we could
rightly judge that the rst students reading perormance was better than that o the second student.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
31/54
23
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
When NCE scores are available or can be derived rom the scoring scheme or a normed measure, using
them to report the pretest and posttest means or the intervention and control samples may help readers
better understand the nature o the eects. It is easier to judge the dierence between the intervention and
control means when the scores are on a 0-100 scale than when they are represented in a less intuitive metric.
Tus a 5-point dierence on a 0-100 scale might be easier to interpret than a 5-point dierence on a raw
score metric that ranges rom, e.g., 143 to 240. NCE scores also preserve the advantage o standard scoresdescribed above o allowing implicit comparisons with the perormance o the norming sample. Tus mean
scores over 50 show better perormance than the comparable norming sample and mean scores under 50
show poorer perormance.
Although standard scores, and NCE scores in particular, oer a number o advantages as a metric with
which to describe intervention eects, they have several limitations. First, standard scores are all derived
rom the norming sample obtained by the developer o the measure. Tus these scores assume that sample
is representative o the population o interest to the intervention study and that the samples in the study,
in turn, are representative o the norming sample. Tese assumptions could easily be alse or intervention
studies that ocus on populations distinctly appropriate or the intervention o interest. Similar discrepancies
could arise or any age-adjusted standard score i the norming measures and the intervention measures
were administered at very dierent times during the school yeardierences could then be the result o
predictable growth over the course o that year (Hills 1984).
Grade Equivalent Scores
A grade equivalent (GE) is a developmental score reported or many norm-reerenced tests that characterizes
students achievement in terms o the grade level o the students in the normative sample with similar
perormance on that test. Grade equivalent scores are based on the nine-month school year and are
represented in terms o the grade level and number o ull months within a nine-month school year. A
GE score thus corresponds to the mean level o perormance at a certain point in time in the school year
or a given grade. Te grade level is represented by the rst number in the GE score and the month o the
school year ollows ater a period with months ranging rom a value o 0 (September) to 9 (June). A GE o
6.2, or example, represents the score that would be achieved by an average student in the sixth grade ater
completion o the second ull month o school. Te dierence between GE scores o 5.2 (November o grade
5) and 6.2 (November o grade 6) represents one calendar years growth or change in perormance.
Te GE score or an individual student in a given grade, or the mean or a sample o students, is inherently
comparative. A GE score that diers rom the grade level a student is actually in indicates perormancebetter or worse than that o the average students at that same grade level in the norming sample. I the mean
GE or a sample o students tested near the end o the ourth grade is 5.3, or instance, these students are
perorming at the average level o students tested in December o the th grade in the norming sample; that
is, they are perorming better than expected or their actual grade level. Conversely, i their mean GE is 4.1,
they are perorming below what is expected or their actual grade level. Tese comparisons, o course, assume
that the norming sample is representative o the population rom which the research sample is drawn.
-
7/30/2019 Translating the statistical representation of the effects of education interventions
32/54
24
Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms
Intervention eects are represented in terms o GE scores simply as the dierence between the intervention
and control sample means expressed in GE units. For example, in a study o Success or All (SFA), a
comprehensive reading program, Slavin and colleagues (1996) compared the mean GE scores or the
intervention and control samples on a reading measure used as a key outcome variable. Tough the numerical
dierences in mean GE scores between the samples were not reported, Figure 6 shows the approximate
magnitudes o those dierences. Te ourth grade students in SFA, or instance, scored on average about 1.8GE ahead o the control sample, indicating that their perormance was closer to that o the mean or ourth
graders in the norming sample than that o the control group. Note also that the mean GE or each sample
taken by itsel identies the groups perormance level relative to the normative sample. Te ourth grade
control sample mean, at 3.0, indicates that, on average, these students were not perorming up to the grade
level mean in the norming sample whereas the SFA sample, by comparison, was almost to grade level.
Figure 6. Mean reading grade equivalent (GE) scores of success for all and control samples
[Adapted from Slavin et al. 1996]
Grade
MeanReadingGEScore
5
4.5
4
3.5
3
2.5
2
1.5
1
SFA Control
Te GE score is oten used to communicate with educators and parents because o its simplicity and inherent
meaningulness. It makes perormance relative to the norm and the magnitude o intervention eects easy to
understand. Furthermore, when used to index change over time, the GE score is an intuitive way to
represent growth in a students achievement. Te simplic