Empirical Methods to Evaluate the Instructional Sensitivity of Accountability Tests

Empirical Methods Empirical Methods to Evaluate the to Evaluate the

Instructional SensitivityInstructional Sensitivityof of

Accountability TestsAccountability Tests

Stephen C. CourtStephen C. Court

Presented atAssociation of Educational Assessment - Europe

10th Annual Conference Innovation in Assessment to meet changing

needs 5 - 7 November 2009

Valletta, Malta

Basic Assumption of Basic Assumption of Accountability SystemsAccountability Systems

Student test scores accurately reflect instructional Student test scores accurately reflect instructional qualityquality

Higher scores = greater learning due to higher Higher scores = greater learning due to higher quality teachingquality teaching

Lower scores = less learning due to lower quality Lower scores = less learning due to lower quality teachingteaching

In short, it is assumed, accountability tests are In short, it is assumed, accountability tests are instructionally sensitive.instructionally sensitive.

RealityReality

The assumption rarely holds.The assumption rarely holds.

Most accountability tests are not sensitive to Most accountability tests are not sensitive to instruction because they simply were not instruction because they simply were not constructed to be instructionally sensitive.constructed to be instructionally sensitive.

The tests are built to the same general “Army The tests are built to the same general “Army Alpha” specifications - originally designed Alpha” specifications - originally designed during the First World War – used to during the First World War – used to differentiate between officer candidates and differentiate between officer candidates and enlisted personnel.enlisted personnel.

Consequences of Consequences of Instructional InsensitivityInstructional Insensitivity

In principleIn principle::

Lack of fairnessLack of fairness Lack of Lack of

trustworthy trustworthy evidence to evidence to support validity support validity argumentsarguments

In practiceIn practice

Bad policyBad policy Bad evaluationBad evaluation Bad things Bad things

happen in the happen in the classroomclassroom

The situation in Kansas - The situation in Kansas - SESSES

SES disparitiesSES disparities

between districtsbetween districts

The situation in KansasThe situation in Kansas- Test Scores- Test Scores

Disparities in state Disparities in state assessment scores and assessment scores and proficiency ratesproficiency rates

The Situation in KansasThe Situation in Kansas

Can the instruction in high-Can the instruction in high-poverty districts be poverty districts be so much so much worserworser than the instruction in low- than the instruction in low-poverty districts?poverty districts?

Or, are construct-irrelevant Or, are construct-irrelevant factors (such as SES) masking the factors (such as SES) masking the effects of instruction?effects of instruction?

The basic questionThe basic question::

What methods can be employedWhat methods can be employed

to evaluate the instructional sensitivity to evaluate the instructional sensitivity of accountability tests?of accountability tests?

DefinitionDefinitionInstructional SensitivityInstructional Sensitivity

“the degree to which students’ performances on a test…

accurately reflect the quality of instruction provided specifically

to promote students’ mastery of the knowledge and skills being assessed.”

((Popham, 2008Popham, 2008))

Two-pronged ApproachTwo-pronged Approach

At last year’s AEA conference in Hissar, At last year’s AEA conference in Hissar, Popham (2008) advocated a two-Popham (2008) advocated a two-pronged approach to evaluating pronged approach to evaluating instructional sensitivity:instructional sensitivity:

Judgmental strategiesJudgmental strategies Empirical studiesEmpirical studies

Empirical StudyEmpirical Study

Following the guidance of Popham Following the guidance of Popham (2007)…(2007)…

three Kansas school districts conducted three Kansas school districts conducted an empirical study of the Kansas an empirical study of the Kansas assessments.assessments.

Description of the Kansas Description of the Kansas StudyStudy

Teachers were invited to complete a brief Teachers were invited to complete a brief online rating form. Participation was online rating form. Participation was voluntary.voluntary.

Each teacher identified the 3-4 indicators Each teacher identified the 3-4 indicators (curricular aims) he or she had taught (curricular aims) he or she had taught best during the 2008-2009 school year.best during the 2008-2009 school year.

Student results were matched to Student results were matched to responding teachers.responding teachers.

Study ParticipantsStudy Participants

575 teachers responded575 teachers responded 320 teachers (grades 3-5 reading and 320 teachers (grades 3-5 reading and

math)math) 129 reading teachers (grades 6-8)129 reading teachers (grades 6-8) 126 math teachers (grades 6-8)126 math teachers (grades 6-8)

14,000 students14,000 students

A Gold StandardA Gold Standard

Typically, test scores are used to confirm teacher Typically, test scores are used to confirm teacher perceptions…as if the test scores are infallible and the perceptions…as if the test scores are infallible and the teachers are always suspect.teachers are always suspect.

In fact, for the first 40 years of inquiry into In fact, for the first 40 years of inquiry into instructional sensitivity, teacher perceptions were instructional sensitivity, teacher perceptions were never even part of the mix. Instructional sensitivity never even part of the mix. Instructional sensitivity studies always contrasted two sets of scores – e.g. pre-studies always contrasted two sets of scores – e.g. pre-test/post-test, not-taught/taught, etc.test/post-test, not-taught/taught, etc.

Asking teachers to identify their best-taught indicators Asking teachers to identify their best-taught indicators has changed the instructional sensitivity issue both has changed the instructional sensitivity issue both conceptually and operationally.conceptually and operationally.

Old and New ModelOld and New ModelInstructional SensitivityInstructional Sensitivity

A = Non-LearningA = Non-Learning

B = LearningB = Learning

C = SlipC = Slip

D = MaintainD = Maintain

A = True FailA = True Fail

B = False PassB = False Pass

C = False FailC = False Fail

D = True PassD = True Pass

Kansas StudyKansas StudyPropensity Score MatchingPropensity Score Matching

Propensity scores were generated from Propensity scores were generated from logistic regression: Several demographic logistic regression: Several demographic and prior performance characteristics were and prior performance characteristics were regressed on overall proficiency rate.regressed on overall proficiency rate.

Probabilities were used to match “Not-Best-Probabilities were used to match “Not-Best-Taught” with “Best-Taught” students using Taught” with “Best-Taught” students using “nearest neighbor” method.“nearest neighbor” method.

Purpose: to form quasi- “random equivalent Purpose: to form quasi- “random equivalent groups” of similar size for each content groups” of similar size for each content area, grade level, and indicator area, grade level, and indicator configuration.configuration.

Basic ContrastBasic ContrastThe basic contrast involved “best-taught” versus “not-The basic contrast involved “best-taught” versus “not-best-taught”best-taught”For example…For example… Grade 3 Reading – Indicator 1…Grade 3 Reading – Indicator 1…

Given average class size, 160 teachers respondedGiven average class size, 160 teachers responded30 teachers identified Indicator 1 as one of their best-30 teachers identified Indicator 1 as one of their best-taught. taught.

From among the pool of other teachers and their From among the pool of other teachers and their students, the propensity score matching was used to students, the propensity score matching was used to form an equivalent group of 750 students from 30form an equivalent group of 750 students from 30 teachers.teachers.

Initial Analysis SchemeInitial Analysis Scheme

Conduct independent t-tests withConduct independent t-tests with

mean indicator score as dependent mean indicator score as dependent variablevariable

Best-taught versus Other students as Best-taught versus Other students as independent variableindependent variable

Initial Analysis SchemeInitial Analysis Scheme

Initial logic:Initial logic:

If best-taught students outperform other If best-taught students outperform other students, indicator is students, indicator is sensitivesensitive to instruction. to instruction.

If mean differences are small or in the wrong If mean differences are small or in the wrong direction, indicator is direction, indicator is insensitive insensitive to to instruction.instruction.

ProblemProblem

But significant performance differences But significant performance differences between best-taught and other students do between best-taught and other students do not not necessarilynecessarily represent significant represent significant differences in instructional sensitivity.differences in instructional sensitivity.

Instead, instructional sensitivity is about Instead, instructional sensitivity is about whether the indicator accurately whether the indicator accurately distinguishes effective from ineffective distinguishes effective from ineffective instruction – without confounding from any instruction – without confounding from any form of construct irrelevant easiness or form of construct irrelevant easiness or difficulty.difficulty.

Basic ConceptBasic ConceptIn its simplest form, Popham’s definition of In its simplest form, Popham’s definition of instructional sensitivity can be depicted as a instructional sensitivity can be depicted as a 2x2 contingency table.2x2 contingency table.

In ContextIn Context

Basic ConceptsBasic ConceptsMeanMean Least effective Least effective

= = B/(A+B)B/(A+B)

MeanMean Most effective Most effective

= = D/(C+D)D/(C+D)

ButBut

Mean Mean Least effective = Least effective = False False Pass/(True Fail + False Pass) Pass/(True Fail + False Pass)

makes no sense at all. makes no sense at all.

In fact, it returns to the outcome as In fact, it returns to the outcome as infallible and the teacher infallible and the teacher perceptions as suspect: If the pass-perceptions as suspect: If the pass-rate for the two groups are rate for the two groups are statistically similar, then the degree statistically similar, then the degree of difference between less and most of difference between less and most effective must be questioned.effective must be questioned.

Conceptually CorrectConceptually Correct

Rather than comparing Rather than comparing means, we instead need means, we instead need to look at the combined to look at the combined proportions of true fail proportions of true fail and true pass. That is,and true pass. That is,

(A + D) / (A + B + C + D)(A + D) / (A + B + C + D)

Which can be shortened to Which can be shortened to

(A + D) / N(A + D) / N

Index 1Index 1

(A + D) / N(A + D) / NRanges from 0 to 1Ranges from 0 to 1

(Completely Insensitive to Totally (Completely Insensitive to Totally Sensitive)Sensitive)

In practice: In practice:

Values < .50 Values < .50 are worse than random guessingare worse than random guessing

Totally SensitiveTotally Sensitive

(A + D)/N =(A + D)/N =

(50 + 50)/100 = (50 + 50)/100 = 1.01.0

A totally sensitive A totally sensitive test would cluster test would cluster students into A or D.students into A or D.

Totally InsensitiveTotally Insensitive

(A+D)/N = (A+D)/N = (0+0)/100 = (0+0)/100 = 0.00.0

A totally insensitive A totally insensitive test clusters students test clusters students into B and Cinto B and C

UselessUseless

(A+D)/N = (A+D)/N = (25+25)/100 = (25+25)/100 = 0.500.50

0.50 = mere chance0.50 = mere chance

Values < 0.50 are Values < 0.50 are worse than chance.worse than chance.

Index 1 EquivalentsIndex 1 Equivalents

Index 1 is conceptually equivalent to:Index 1 is conceptually equivalent to:

Mann-Whitney UMann-Whitney U

Wilcoxon statisticWilcoxon statistic

Transposing Cell A and Cell B, then Transposing Cell A and Cell B, then running a t-testrunning a t-test

Area Under the Curve (AUC) in Receiver Area Under the Curve (AUC) in Receiver Operating Characteristic (ROC) curve Operating Characteristic (ROC) curve analysisanalysis

ROC Curve AnalysisROC Curve Analysis

Has rarely been used in domain of Has rarely been used in domain of educational researcheducational research

More commonly used in More commonly used in medicine and radiologymedicine and radiology data mining (information retrieval)data mining (information retrieval) artificial intelligence (machine learning)artificial intelligence (machine learning)

The use of ROC curves was first introduced during The use of ROC curves was first introduced during WWII in response to the challenge of how to accurately WWII in response to the challenge of how to accurately identify enemy planes on radar screens.identify enemy planes on radar screens.

AUC ContextAUC ContextROC Curve Analysis – especially the AUC - is ROC Curve Analysis – especially the AUC - is more useful for several reasons:more useful for several reasons:

Easily computed Easily computed Easily interpretedEasily interpreted Decomposable into sensitivity and specificityDecomposable into sensitivity and specificity

Sensitivity = D / (B+D) Specificity = C / (A+C)

Easily graphed as (Sensitivity) versus (1 – Easily graphed as (Sensitivity) versus (1 – Specificity)Specificity)

Readily expandable to polytomous situations Readily expandable to polytomous situations Multiple test items in a subscale Multiple subscales in a test Multiple groups being tested

Basic InterpretationBasic Interpretation(Descriptive)(Descriptive)

Easy to compute: (A+D)/NEasy to compute: (A+D)/NEasy to interpret…Easy to interpret…

.90-1.0 = excellent (A) .90-1.0 = excellent (A) .80-.90 = good (B) .80-.90 = good (B) .70-.80 = fair (C) .70-.80 = fair (C) .60-.70 = poor (D) .60-.70 = poor (D) .50-.60 = fail (F).50-.60 = fail (F)

Less than .50 is worse than guessing!Less than .50 is worse than guessing!

Basic InterpretationBasic Interpretation

Most statistical software packages – e.g., Most statistical software packages – e.g., SAS, SPSS - include a ROC procedure.SAS, SPSS - include a ROC procedure.

The area under the curve table displays The area under the curve table displays estimates of the area, estimates of the area, standard error of the area, standard error of the area, confidence limits for the area, confidence limits for the area, and the p-value of a hypothesis test.and the p-value of a hypothesis test.

ROC Hypothesis TestROC Hypothesis Test

The null hypothesis: true AUC = .50.The null hypothesis: true AUC = .50.

So, use of ROC Curve Analysis in this context would So, use of ROC Curve Analysis in this context would support rigorous psychometric inquiry into instructional support rigorous psychometric inquiry into instructional sensitivity.sensitivity.

Yet, the A, B, C, D, F system could be reported in ways that Yet, the A, B, C, D, F system could be reported in ways that even the least experienced reporters or policy-makers can even the least experienced reporters or policy-makers can readily understand.readily understand.

Area Under Curve (AUC) - Area Under Curve (AUC) - GraphedGraphed

Curve 1Curve 1 = .50 = .50 Pure Pure chance…no better than chance…no better than random guessrandom guess

Curve 4Curve 4 = 1.0 = 1.0 Totally Totally Sensitive Sensitive completely completely accurate discrimination accurate discrimination between effective and between effective and less-effective less-effective instructioninstruction

Curve 3Curve 3 is better than is better than Curve 2Curve 2

ROC Curve InterpretationROC Curve Interpretation

Greater AUC values Greater AUC values indicate greater separation indicate greater separation between distributionsbetween distributions

e.g., Most effective versus less e.g., Most effective versus less effectiveeffective

Best taught versus Not-best-Best taught versus Not-best-taughttaught

1.0 = complete separation 1.0 = complete separation – that is, total sensitivity– that is, total sensitivity

ROC Curve InterpretationROC Curve Interpretation

AUC values close to .50 AUC values close to .50 indicate no separation indicate no separation between distributions.between distributions.

AUC = .50 indicatesAUC = .50 indicates complete overlapcomplete overlap No difference No difference might as well might as well

guessguess

Procedural ReviewProcedural Review

Step 1: Step 1: Cross-tabulate not/pass status with teacher Cross-tabulate not/pass status with teacher identification of best-taught indicatorsidentification of best-taught indicators

Step 2:Step 2: (Optional) Use logistic regression and (Optional) Use logistic regression and propensity score matching to create propensity score matching to create randomly-equivalent groups – or, as close as randomly-equivalent groups – or, as close as you can getyou can get

Step 3:Step 3: Use Use (A+D)/N(A+D)/N or formal ROC Curve Analysis to or formal ROC Curve Analysis to evaluate instructional sensitivity at the evaluate instructional sensitivity at the smallest grain-size possible – preferably, at smallest grain-size possible – preferably, at the wrong/right level of individual items.the wrong/right level of individual items.

In ClosingIn Closing

The assumption that accountability tests are The assumption that accountability tests are sensitive to instruction rarely holds.sensitive to instruction rarely holds.

Inferences drawn from test scores about Inferences drawn from test scores about school quality and teaching effectiveness school quality and teaching effectiveness must be validated before action is taken.must be validated before action is taken.

The empirical approaches presented here The empirical approaches presented here should prove helpful in determining if the should prove helpful in determining if the inference that a test is instructionally inference that a test is instructionally sensitive is indeed warranted.sensitive is indeed warranted.

Presenter’s email Presenter’s email address:address:

[email protected]@usd259.net

Questions, comments, or Questions, comments, or suggestions are welcomesuggestions are welcome

mailto:[email protected]

Empirical Methods to Evaluate the Instructional Sensitivity of Accountability Tests

Documents

Transcript of Empirical Methods to Evaluate the Instructional Sensitivity of Accountability Tests