Download - Item Analysis: A Crash Course

Transcript
Page 1: Item Analysis: A Crash Course

Item Analysis: A Crash Course Lou Ann Cooper, PhD

Master Educator Fellowship Program

January 10, 2008

Page 2: Item Analysis: A Crash Course

ValidityValidity refers to “the appropriateness,

meaningfulness, and usefulness of the specific inferences made from test scores.”

“Validity is an integrative summary.” (Messick, 1995)

Validation is the process of building an argument supporting interpretation of test scores. (Kane, 1992)

Page 3: Item Analysis: A Crash Course

ReliabilityConsistency, reproducibility, generalizabilityVery norm-referenced – relative standing in a

groupOnly scores can be described as reliable, not

tests.Reliability depends on

Test Length – number of items Sample of test takers – group homogeneity Score range Dimensionality – content and skills tested

Page 4: Item Analysis: A Crash Course

Planning the Test Test blueprint / table of specifications

Content, skills, domains Level of cognition Relative importance of each element

Linked to learning objectives.

Provides evidence for content validity.

Page 5: Item Analysis: A Crash Course

Test Blueprint: Third Year Surgery Clerkship

Recall Application Problem Solving TOTALS

Gen Surg 9 6 1 16 (27%)ENT 6 4 0 10 (17%)Neuro 2 1 0 3 (5%)Ophthal 1 2 0 5 (8%)Ortho 6 4 0 17(28%)Plastic 1 1 0 3 (5%)Urol 5 4 1 17(28%)CT 1 1 0 3 (5%)Vascular 2 1 1 6 (10%)

TOTALS 33 (58%) 24 (38%) 3 (4%) 60 (100%)

Cognitive LevelC

onte

nt

Page 6: Item Analysis: A Crash Course

Test Statistics A basic assumption: items measure a single subject area or underlying ability.

General indicator of test quality is a reliability estimate.

The measure most commonly used to estimate reliability in a single administration of a test is Cronbach's Alpha. Measure of internal consistency.

Page 7: Item Analysis: A Crash Course

Cronbach’s alphaCoefficient alpha reflects three characteristics

of the test:

The interitem correlations -- the greater the relative number of positive relationships, and the stronger those relationships are, the greater the reliability. Item discrimination indices and the test's reliability coefficient are related in this regard.

The length of the test -- a test with more items will have a higher reliability, all other things being equal.

The content of the test -- generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability.

2

211

i

T

nn

Where

Total test variance = the sum of the item variances +

twice the unique covariances

Page 8: Item Analysis: A Crash Course

Descriptive StatisticsTotal test score distribution

Central tendency Score Range Variability

Frequency distributions for individual items – allows us to analyze the distractors.

Page 9: Item Analysis: A Crash Course

Mean = 75.98 (6.78)Median = 77 Mode = 72

Human Behavior Exam

Page 10: Item Analysis: A Crash Course

Item StatisticsResponse

frequencies/distributionMean Item variance/standard deviationItem difficultyItem discrimination

Page 11: Item Analysis: A Crash Course

Item AnalysisExamines responses to individual test items

from a single administration to assess the quality of the items and the test as a whole.

Did the item function as intended?Were the test items of appropriate difficulty?Were the test items free from defects?

TechnicalTestwisenessIrrelevant difficulty

Was each of the distractors effective?

Page 12: Item Analysis: A Crash Course

Item Difficulty For items with one correct answer worth a

single point, difficulty is the percentage of students who answer an item correctly, i.e. item mean.

When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.

Ranges from 0 to 1.00 - the higher the value, the easier the question.

Page 13: Item Analysis: A Crash Course

Item Difficulty Item difficulty is relevant for determining

whether students have learned the concept being tested.

Plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not.

To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item.

Page 14: Item Analysis: A Crash Course

Ideal difficulty levels for MCQ

Lord, F.M. "The Relationship of the Reliability of Multiple-Choice Test to the Distribution of Item Difficulties," Psychometrika, 1952, 18, 181-194

Format Ideal Difficulty

Five-response multiple-choice 70

Four-response multiple-choice 74

Three-response multiple-choice 77

Two-response multiple-choice (T/F) 85

Page 15: Item Analysis: A Crash Course

Item Difficulty

Assuming a 5-option MCQ, rough guidelines for judging difficulty:

≥ .85 Easy > .50 and < .85 Moderate < .50 Hard

Page 16: Item Analysis: A Crash Course

Item Discrimination Ability of an item to differentiate among

students on the basis of how well they know the material being tested.

Describes how effectively the test item differentiates between high ability and low ability students.

All things being equal, highly discriminating items increase reliability.

Page 17: Item Analysis: A Crash Course

Discrimination IndexD = pu - pl

pu = proportion of students in the upper group who were correct.

pl = proportion of students in the lower group who were correct.

D .40 satisfactory item functioning .30 D .39 little or no revision required .20 D .29 marginal - needs revision D < .20 eliminate or complete

revision

Page 18: Item Analysis: A Crash Course

Point biserial correlationCorrelation between performance on a single item and performance on the total test.

- High and positive: best students get the answer correct; poorest students

get it wrong.- Low or zero: no relationship

between performance on the item and the total test.

- High and negative: Poorest students get the item correct; best get it wrong.

Page 19: Item Analysis: A Crash Course

Point biserial correlation rpbis tends to be lower for tests measuring a

wide range of content areas than for more homogeneous tests.

Items with low discrimination indices are often ambiguously worded.

A negative value may indicate that the item was miskeyed.

Tests with high internal consistency consist of items with mostly positive relationships with total test score.

Page 20: Item Analysis: A Crash Course

Item Discrimination

Rough guidelines for rpbis

> .30 Good>.10 and < .30 Fair< .10 Poor

( ) /( ) 1pbisr

sdT

T

X X p p

Page 21: Item Analysis: A Crash Course

Item Analysis Matrix

Hard P < . 50

Moderate .50 P .85

Easy P > .85

.10

.10 > r pbis .30

> .30

Item Difficulty

Item Discrimination

Page 22: Item Analysis: A Crash Course

ITEM 1

ITEM 2RESPONSE

OPTIONSPERCENT CORRECT DIFFICULTY POINT BISERIAL

CORRELATION

A .72 .72 .04B .03C .04D .09E .12

RESPONSE OPTIONS

PERCENT CORRECT DIFFICULTY POINT BISERIAL

CORRELATION

A .72 .72 .40B .03C .04D .09E .12

Page 23: Item Analysis: A Crash Course

ITEM 4

ITEM 3

RESPONSE OPTIONS

PERCENT RESPONDING DIFFICULTY POINT BISERIAL

CORRELATION

A .01B .00C .01

D .98 .98 .00E .00

RESPONSE OPTIONS

PERCENT RESPONDING DIFFICULTY POINT BISERIAL

CORRELATION

A .02B .15C .05

D .70 .70 −.19E .08

Page 24: Item Analysis: A Crash Course

A Sample of MS1 Exams

Course Exam Mean Items Number Percent Number PercentA 1 97.53 25 0 0.0 7 28.0%

2 97.84 29 0 0.0 6 20.7%3 94.60 26 1 3.8 1 3.8%

B 1 86.93 50 7 14.0 33 66.0%2 87.34 58 16 27.6 41 70.7%3 88.66 50 6 12.0 29 58.0%

C 1 86.13 50 11 22.0 41 82.0%2 90.36 51 4 7.8 33 64.7%3 87.31 52 11 21.2 35 67.3%

Difficulty Index.40 - .80

Discrimination Indexrpbis > .10

Page 25: Item Analysis: A Crash Course

Cautions Item analyses reflect internal

consistency of items rather than validity. The discrimination index is not always a

measure of item quality Extremely difficult or easy items will have

low ability to discriminate but such items are often needed to adequately sample course content and objectives.

An item may show low discrimination if the test measures many different content areas and cognitive skills.

Page 26: Item Analysis: A Crash Course

CautionsItem analysis data are tentative. Influenced by:

type and number of students being tested

instructional procedures employed both systematic and random

measurement error

If repeated use of items is possible, statistics should be recorded for each administration of each item.

Page 27: Item Analysis: A Crash Course

RecommendationsValuable tool for improving items to be used in future tests – item banking.

Modify or eliminate ambiguous, misleading, or flawed items. Helps improve instructors’ skills in test construction. Identifies specific areas of course content which need greater emphasis or clarity.

Page 28: Item Analysis: A Crash Course

ResearchDowning SJ. The effects of violating standard item

writing principles on tests and students: The consequences of using flawed items on achievement examinations in medical education. Adv Health Sci Educ 10:133-143, 2005.

Jozefowicz RF et al. The quality of in-house medical school examinations. Acad Med 77(2):156-161, 2002.

Muntinga JH, Schull HA. Effects of automatic item eliminations based on item test analysis. Adv Physiol Educ 31: 247-252, 2007.