Item Analysis: A Crash Course

download Item Analysis: A Crash Course

of 28

  • date post

  • Category


  • view

  • download


Embed Size (px)


Item Analysis: A Crash Course. Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008. Validity. Validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.” “Validity is an integrative summary.” (Messick, 1995) - PowerPoint PPT Presentation

Transcript of Item Analysis: A Crash Course

Crisp White & Navy

Item Analysis: A Crash Course Lou Ann Cooper, PhD

Master Educator Fellowship ProgramJanuary 10, 20081ValidityValidity refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.Validity is an integrative summary. (Messick, 1995)Validation is the process of building an argument supporting interpretation of test scores. (Kane, 1992)

2The concept of validity is really at the core of testing.

(Click) Validity is not a quality of the test or the test score, but of the meaning and inferences drawn from that test score. Older concepts of validity concentrated on the validity of content (representation of the subject area), construct (is the test proper for a skill) and criterion (does the information correlate with other findings)

(Click) The current way we think about validity is that of a unified hypothesis.

(Click) Validating a test is the process of collecting empirical data and logical arguments to support that our inferences (like grades) are correct.ReliabilityConsistency, reproducibility, generalizabilityVery norm-referenced relative standing in a groupOnly scores can be described as reliable, not tests.Reliability depends onTest Length number of items Sample of test takers group homogeneity Score range Dimensionality content and skills tested3Reliability reflects the extent to which the test would yield the same ranking of examinees if readministered with no effect from the first administration.

Reliability is necessary, but not sufficient for validity. You can measure with good reliability and that measurement can have very little to do with the inferences you are trying to make.

Neither is a property of a test.

Purpose of an assessment dictates what type of reliability is important and the sources of validity evidence necessary to support the desired inferences.

Planning the TestTest blueprint / table of specifications

Content, skills, domains Level of cognition Relative importance of each element

Linked to learning objectives.

Provides evidence for content validity.4It seems obvious but tests, even classroom tests, should be planned.

One problem especially in multi-instructor lecture-based preclinical courses is that

Test Blueprint: Third Year Surgery Clerkship

Content5Test Statistics A basic assumption: items measure a single subject area or underlying ability.

General indicator of test quality is a reliability estimate. The measure most commonly used to estimate reliability in a single administration of a test is Cronbach's Alpha. Measure of internal consistency.6The measure most commonly used to estimate reliability in a single administration of a test is Cronbach's Alpha.

A basic assumption is that the test under analysis is composed of items measuring a single subject area or underlying ability. The quality of the test as a whole is assessed by estimating its "internal consistency." The quality of individual items is assessed by comparing students' item responses to their total test scores.

Cronbachs alphaCoefficient alpha reflects three characteristics of the test:

The interitem correlations -- the greater the relative number of positive relationships, and the stronger those relationships are, the greater the reliability. Item discrimination indices and the test's reliability coefficient are related in this regard.

The length of the test -- a test with more items will have a higher reliability, all other things being equal.

The content of the test -- generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability.

Where Total test variance = the sum of the item variances + twice the unique covariances7Descriptive StatisticsTotal test score distribution

Central tendency Score Range Variability

Frequency distributions for individual items allows us to analyze the distractors.8All of the incorrect options, or distractors should actually be distracting.

Existence of five options does not automatically guarantee that the item will operate as a five choice item.If in a five option multiple choice item only one distractor is effective, the item is for all practical purpose a two-option item with the probability of getting the answer correct by guessing alone is .50.

Distrctors that are not chosen by any examinees should be rewritten or eliminated. They are not contributing to the tests ability to discriminate good students from poor students.

You shouldnt be concerned is each distractor is not chocen by the same number of examinees.

The fact that a majority of the students miss an item does not imply the the item should be changes although it certainly should be checked for its accuracy,

Should be suspicious of any item in which a single distractor is chosen more often than all other options.

Mean = 75.98 (6.78)Median = 77 Mode = 72Human Behavior Exam9I have his permission to use this so here is the score distribution from one of his Human Anatomy examinations from last year. Im not picking on Rob this is actually a pretty good testItem StatisticsResponse frequencies/distributionMean Item variance/standard deviationItem difficultyItem discrimination

10Item statistics are used to assess the performance of individual test items on the assumption that the overall quality of a test derives from the quality of its items. Item AnalysisExamines responses to individual test items from a single administration to assess the quality of the items and the test as a whole.Did the item function as intended?Were the test items of appropriate difficulty?Were the test items free from defects?TechnicalTestwisenessIrrelevant difficultyWas each of the distractors effective?11Item analysis is a process by which we examine students responses to individual test item from a single administration to assess the quality of the items and the test as a whole. The basic idea is that the statistical behavior of bad items is fundamentally different from that of good items.

Some of the questions we are interested in answering are

90% of item analysis is somewhat common sense;

Potential miskeyAmbiguous itemsDistractors are not workingItem is not discriminatingNegative discriminationToo easyItem DifficultyFor items with one correct answer worth a single point, difficulty is the percentage of students who answer an item correctly, i.e. item mean.

When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.

Ranges from 0 to 1.00 - the higher the value, the easier the question.

12Really more of an item easiness measureItem DifficultyItem difficulty is relevant for determining whether students have learned the concept being tested. Plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not. To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item.

13To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item. (The chance score for five-option questions, for example, is 20 because one-fifth of the students responding to the question could be expected to choose the correct option by guessing.) Ideal difficulty levels for MCQ Lord, F.M. "The Relationship of the Reliability of Multiple-Choice Test to the Distribution of Item Difficulties," Psychometrika, 1952, 18, 181-194

14What we want to do in this type of testing situation is maximize what psychometricians refer to as true score variance a persons ability free from measurement error.

Item variance is maximized when half the students get the question wrong and the other half get it right.

But then there is the effect of guessing and partial knowledge so it turns out that the ideal difficulty levels are on providing maximum item variance while factoring in the probability that one could get the answer by guessing alone.Item Difficulty Assuming a 5-option MCQ, rough guidelines for judging difficulty:

.85 Easy > .50 and < .85Moderate < .50Hard15A somewhat arbitrary classification of item difficulty: "easy" if the index is 85% or above; "moderate" if it is between 51 and 84% "hard" if it is 50% or below.Item Discrimination Ability of an item to differentiate among students on the basis of how well they know the material being tested.

Describes how effectively the test item differentiates between high ability and low ability students.

All things being equal, highly discriminating items increase reliability.16Discrimination IndexD = pu - plpu = proportion of students in the upper group who were = proportion of students in the lower group who were correct.

D .40 satisfactory item functioning .30 D .39 little or no revision required .20 D .29marginal - needs revision D < .20eliminate or complete revision17Positive discrimination indicates that the item is discriminating in the same direction as the total test.

The discriminating power of an achievement test item refers to the degree to which it discriminates between students with high achievement and low achievement

A computationally simple discrimination index appropriate for use with classroom tests that can be calculated by hand.Rank the test papers in order from highest to lowest. Idenitfy upp