A.hotiu Thesis

1

THE RELATIONSHIP BETWEEN ITEM DIFFICULTY AND

DISCRIMINATION INDICES IN MULTIPLE-CHOICE TESTS IN A

PHYSICAL SCIENCE COURSE

by

Angelica Hotiu

A thesis is submitted to the faculty of Charles Schmidt College of Science in

partial fulfillment of the requirements for the Degree of

Master in Science

Florida Atlantic University

Boca Raton, Florida

December 2006

2

ABSTRACT

Author: Angelica Hotiu

Title: The relationship between item difficulty and discrimination

indices in multiple-choice tests in a physical science course

Institution : Florida Atlantic University

Thesis advisor Dr. Robin Jordan

Degree Master of Science

Year 2006

We have developed a method of quantifying multiple-choice test items in an

introductory physical science course in terms of the various tasks required to solve

the problem. We assign a numerical level of difficulty to each task so that any

question can be assigned a degree of difficulty, which is the sum of the individual

levels of difficulty associated in each steps. Using the questions and results from the

tests we have investigated the relationship between the degree of difficulty of each

question and the corresponding discrimination index. Our results indicate that as the

degree of difficulty increases so does the capability of the item to discriminate

between students with different abilities. There is a maximum degree of difficulty

beyond which the discrimination starts to decrease. At that point, test items become

too difficult. Thus, it should be possible in future to design items that will provide

optimum discrimination.

3

ACKNOWLEDGEMENTS

First of all I would like to express my sincere gratitude and appreciation to Dr. Robin

Jordan for his effort, guidance, devotion and advice during the entire study and the

preparation of the thesis.

Also many thanks are extended to all members of the faculty, staff and graduate

students in the Department of Physics at FAU. I am also very grateful to Dr. Warner

Miller, Chair of the Department of Physics, and the members of my thesis committee

for all their advice and assistance.

Thanks also go to Dr. Fernando Medina for giving me the opportunity to study in

Physics Department at Florida Atlantic University.

Finally, I want to express my gratitude to my lovely family, to my parents and

specially to my mother taking care of my son during my studies and to my husband,

Laurentiu, who supported and encouraged me during this studies. Thank you for all

that you have done for me.

4

Contents

1. Introduction …………………………………………………………… 5

2. Theory………………………………………………………………… 15

2.1 Anatomy of multiple-choice questions………………................... 15

2.2 Bloom’s taxonomy…………………………………….................. 16

2.3 The cognitive domain…………………………………………… 18

2.4 The discrimination index………………………………………… 24

2.5 The degree of difficulty………………………………................... 25

3. Results of the Research……………………………………………… 35

3.1 Description……………………………………………................. 35

3.2 Analysis of the questions for which D>0.5 ……………………… 37

4. Concluding remarks………………………………………………… 52

5. References…………………………………………………………… 55

5

Chapter 1

Introduction

The classroom test is one of the most important parts of the teaching and learning

process. There are several different types of tests – those with short essay answers,

multiple-choice answers, etc. – and the type used will depend on a number of factors

such as the instructional objectives, the class size, the type of instruction, the type of

subject matter and the type of feedback required by the instructor. However, the two

most important characteristics of any achievement test are its content validity and

reliability. A test's validity is determined by how well it samples the range of

knowledge, skills, and abilities that students were supposed to acquire in the period

covered by the test. The reliability of a test depends upon grading consistency and

discrimination between students of differing performance levels.

There are two major types of multiple-choice tests, criterion-referenced tests (CRTs)

and norm-referenced tests (NRTs). In criterion-referenced testing, the goal is usually

to make a decision about whether or not an individual can demonstrate mastery in an

area of content and competencies; examples include the written part of a driving test,

certification and licensure exams. In norm-referenced testing, the goal is usually to

rank the entire set of individuals in order to make comparisons of their performances

relative to one another. In this study, we will be analyzing students’ performances on

multiple-choice tests administered during a physical science course; such tests are

NRTs.

6

Although multiple-choice tests are widely used, many instructors do not hold them in

high regard; some believe, for example, that multiple-choice questions are really

“multiple-guess” items, or that multiple-choice questions are only capable of testing

factual information and so are ill suited for testing higher-order cognitive skills.

However, it is now accepted that well-constructed multiple-choice items can test

many of the same cognitive skills that essay test do. Moreover, they can be used to

diagnose student difficulties if the incorrect options are designed to reveal common

misconceptions, and they can provide a more comprehensive sampling of the subject

material because more questions can be asked. In addition, they are often more valid

and reliable than essay tests because (a) they sample material more broadly; (b)

discrimination between performance levels is easier to determine; and (c) scoring

consistency is virtually guaranteed when carried out by machine.

The validity of multiple-choice tests depends upon a systematic selection of items

with regard to both content and level of learning. Although most teachers try to select

items that sample the range of content covered in class, they often fail to consider the

level or degree of difficulty of the questions they use. Moreover, since it is easy to

develop items that require only recognition or recall of information, instructors tend

to rely heavily on those types of questions. Unfortunately, multiple-choice tests in the

instructor’s manuals that accompany textbooks are often composed exclusively of

recognition or recall items.

7

Psychologists have elaborate systems for classifying different cognitive levels, but for

most test planning purposes, a simple three-level scheme is sufficient to ensure that

the range of knowledge, skills, and abilities are tested appropriately. The three

categories are recall, application, and evaluation/synthesis, and they are derived from

the six levels of “Bloom’s taxonomy” of cognitive objectives [1.1]. At the lowest

level, recall, students remember specific facts, terminology, principles, or theories,

e.g., stating Newton’s 2nd Law. At the median level, application, students use their

knowledge to solve a problem or analyze a situation, e.g., using Newton’s 2nd Law to

determine the motion of an object. The highest level, evaluation and synthesis,

requires students to derive hypotheses from data, or put the parts of a problem

together, or exercise informed judgment. By analyzing the course material in terms

of these three categories, multiple-choice tests can be constructed that sample both

the range of content and the various cognitive levels at which the students must

operate. Performing this analysis is an essential step in designing multiple-choice

tests that have high validity and reliability.

The purpose of this study is not to provide to comprehensive guide for constructing

multiple-choice items; there are several excellent articles available that provide such

information [1.2, 1.3]. Our main aim is to investigate and quantify two of the most

important factors in creating valid and discriminating multiple-choice tests, namely,

the degree of difficulty and the discrimination index - we define these quantities

below – using the results of actual tests. We have been unable to find any previously

published, quantitative data on such a study, except for a private communication from

8

Hostetter and Haky who made a similar study of multiple-choice test items in

introductory General Chemistry [1.4]. Accordingly, we have analyzed the results of

six multiple-choice tests (labeled 1A, 1B, 2A, 2B, 3A and 3B) given in a Physical

Science class (PSC2121), at Florida Atlantic University in the Fall 2004 semester.

The numbers of students that took each test was

!

~ 50. The numbers 1, 2, and 3,

represent the number of the test during the semester – there were five tests in total -

and A, B represent two different versions given to different group of the students but

covering the same material and designed to be as “similar” as possible. Physical

science is a general science course for non-science majors, covering topics in physics,

chemistry and earth science. However, in this study we restricted ourselves to

questions on topics that were within the physics discipline; the subject material

covered by the tests is shown in Table 1.1 and the number of students taking each test

and the average scores are shown in Table 1.2. The tests were compiled by Dr. Robin

Jordan, Physics Department, Florida Atlantic University.

9

Chapter Topics

Physical science and measurement Why standardization?

The metric System

SI units

Description of the motion Vector analysis

Resolution of vectors

Speed and velocity

Accelerated motion

A theory of motion

Galileo and the Experimental motion

Planetary motion Ptolomy’s system

The Copernican revolution

“Gateway to the skies”: Tyco Brahe

How planets move: Johannes Kepler

Galileo’s Discoveries with the Telescope

Law of motion and gravitation Isaac Newton’s “Marvelous year”

The principia

Newton’s first law of motion. Inertia

Newton’s 2nd law of motion. Force

Applications of Newton’s 2nd law

Newton’s 3rd law of motion. Action and

reaction

The “Center-Seeking” force

10

Chapter Topics

Heat- A form of Energy Temperature measurement

Temperature scale

The lowest temperature

Kinetic theory and the molecular

interpretation of temperature

Temperature and the heat

Specific heat

Calorimetry

Change of state

Thermal expansion

Energy Conservation Mechanical equivalent of the heat

The 1st law of thermodynamic

The 2nd law of thermodynamic

Wave Motion and Sound Transverse waves

Longitudinal waves

Reflection of the waves

Refraction of the waves

Superposition of the waves. Interference

Standing waves

Vibrating air columns

Light and other electromagnetic waves The velocity of the light

Electromagnetic waves

11

Chapter Topics

Electromagnetic spectrum, radio, TV,

microwaves

Simple lenses

The optic of the eye

Electricity and Magnetism Amber phenomenon

Conductors, Semiconductors, Insulators

Forces between electric

Electric current

Electric circuits

Electric power and energy

The Quantum Theory of Radiation and

Matter

Spectroscopy

The electron

X Rays

Radioactivity

Planck’s Quantum hypothesis

Einstein’s photoelectric equation

Table 1.1

The subject material (chapters and topics) covered by the tests

12

Test Number of

questions

Number of

respondents

Scoring range

(%)

Average score

(%)

1A 30 52 26.7 – 86.7 56.9

1B 30 53 16.7 – 86.7 53.7

2A 30* 52 16.7 – 90.0 55.2

2B 31 53 19.4 – 77.4 52.6

3A 30 48 30.0 – 86.7 55.6

3B 30 48 26.7 – 83.3 57.0

Table 1.2

Details of the tests used in this study. The tests were part of the

PSC2121 course given in the Fall 2004 semester. * One question was

omitted from the analysis due to a technical problem.

Each question on a multiple-choice test has a discrimination index that determines

how well each question discriminates between students in the top 27% of the class on

13

total test score and those in the lower 27% of the class on total test score. As we

explain in more detail below, the discrimination index can range from

!

+1 to

!

"1; a

value of

!

+1 means that all of the “high scorers” answered the question correctly and

all of the “low scorers” answered the question incorrectly. A value of 0 means that

the same number of high scorers and low scorers obtained the correct answer and so

the question does not discriminate between the two sub-groups of students. In this

study, we analyzed the questions from all tests for which the discrimination index was

>0.5. To determine the degree of difficulty, we identified the various tasks or

operations, such as memorization and identification, application, unit conversion,

algebraic manipulation, use of vectors, etc., required to answer each question [1.5].

We assigned a numerical level of difficulty to each task, based on the range of

knowledge, skill, and ability required, so that any question involving a number of

different steps has an overall degree of difficulty, which is the sum of the individual

levels of difficulty associated with each of the required steps.

The results indicate a definite correlation between the degree of difficulty and the

discrimination index. For example, as the degree of difficulty increases so does the

discrimination index, which is not unexpected. However, there is a maximum degree

of difficulty beyond which the discrimination index starts to fall off. At that point,

the test items become too difficult for both the high scorers and the low scorers to

answer, so that they no longer discriminate effectively. Clearly, there are two

extremes; questions that are too easy, i.e., with a small difficulty value, and those that

are too hard, i.e., with a high difficulty value. Such questions are not effective if the

14

purpose of a test is to produce a spread of scores, reflecting differences in student

achievement and abilities.

As part of our study, we have been able to identify the common tasks are that are

involved in the most discriminating questions. Our results suggest that for optimum

discrimination, i.e., questions resulting in a discrimination index

!

> 0.5, the degree of

difficulty lies within a reasonably well-defined range for all the tests analyzed. So, in

principle, by adopting our assigned levels of difficulty for each task or operation, one

can actually design questions with the required level of difficulty and range of

cognitive levels that will result in multiple-choice tests that truly discriminate

between students of different abilities.

Our study is very similar to the analysis of multiple-choice test items in a General

Chemistry I course, carried out by Hostetter and Haky [1.5]. Indeed, it was their

study that prompted ours. Altogether, they used the results from approximately 300

students; a somewhat larger sampling group compared with our study. Our results –

based on an analysis of physics topics - indicated a similar correlation between the

degree of difficulty and discrimination; namely, as the difficulty increased the

average discrimination increased, but there was a critical level of difficulty beyond

which the discrimination decreased.

15

Chapter 2

Theory

2.1. Anatomy of multiple-choice questions

A standard multiple-choice item consists of basic two parts:

• A problem (the stem)

• A list of suggested solutions (alternatives)

Typically, multiple-choice items present the stem in the complete question form or an

incomplete statement and the list of alternatives contains one correct or best

alternative (answer) and a number of incorrect or inferior alternatives (distractors).

For example:

Stem in complete form Incomplete statement

What is a weight of an object?

• The force with which it is

attracted to the earth

• The amount of matter that it

contains

• A measure of its inertia

• The same quantity as its mass but

expressed in different units

The weight of an object is:

• The force with which it is

attracted to the earth

• The amount of matter that it

contains

• A measure of its inertia

• The same quantity as its mass but

expressed in different units

16

Students are directed to select either the correct answer or the best answer from the

list of options provided. In the correct answer form, the answer is correct beyond

question while the distractors are definitely incorrect. In the best answer version,

more than one option may be appropriate in varying degrees. The purpose of the

distractors is to appear as plausible solutions to the problem for those students who

have not achieved the required learning examined by the question. On the other hand,

the distractors will appear as implausible solutions for those students who achieved

the required learning; only the (required correct) answer is plausible for those

students. As we mentioned in the Introduction, multiple-choice items can be

designed to test not only the lower levels of the learning process, i.e., recall, but also

the higher-level skills of comprehension, application, analysis and all of which may

be part of the required educational objectives of the class.

2.2 Bloom’s taxonomy

Starting in 1948, a committee of colleges, led by Benjamin Bloom, began the task of

classifying education goals and objectives. The intent was to develop a classification

system for three domains: the cognitive, the affective, and the psychomotor:

• Cognitive: mental skills (Knowledge)

• Affective: growth in feelings or emotional areas (Attitude)

• Psychomotor: manual or physical skills (Skills)

17

They completed their study on the cognitive domain in 1956 and the resulting

classification system is now commonly referred to as Bloom's Taxonomy of the

Cognitive Domain [2.1]. Work on the affective and psychomotor domains was

completed in 1972-3 [2.2, 2.3]. The divisions between different classes of skills or

behavior are not absolute and other systems or hierarchies have been devised in the

educational and training world. However, Bloom's taxonomy is the most easily

understood and is arguably the one most used today.

The major idea of the taxonomy of the cognitive domain is that what educators want

students to “know”, i.e., the educational objectives, can be arranged in a hierarchy,

starting from the simplest behavior or skill to the most complex. As a result, it can also

provide a useful structure within which to categorize and analyze test items.

Instructors characteristically ask questions within particular skill levels, for example,

Bloom found that over 95 % of the test questions students encounter require them to

think only at the lowest possible level, i.e., the recall of information. However,

education research shows that students remember more, and can apply their

knowledge more effectively, when they have learned to handle the topic at the higher

levels of the taxonomy, where more complex skills are required [2.4, 2.5]. Clearly,

Learning Process

Cognitive

Affective

Psychomotor

18

students can "know" about a topic or subject at different levels. So, it is plain there

must be a close link between the taxonomy and test questions, if the latter are

constructed with the aim of checking the skill level of students, and discriminating

between students of different abilities.

2.3 The cognitive domain

The cognitive domain involves knowledge and the development of intellectual skills.

This includes the recall or recognition of specific facts, procedural patterns, and

concepts that serve in the development of intellectual abilities and skills. There are six

major categories, which are shown in Tables 2.1 to 2.3, starting from the simplest

behavior to the most complex. The categories can be thought of as degrees or

hierarchies of difficulties.

19

Competence Skills demonstrated

1. Knowledge • observation and recall of

information

• knowledge of dates, events, places

• knowledge of major ideas

• mastery of subject matter

• Keywords

list, define, tell, describe, identify,

show, label, collect, examine,

tabulate, quote, name, who, when,

where, etc.

2. Comprehension • understanding information

• grasp meaning

• translate knowledge into new

context

• interpret facts, compare, contrast

• order, group, infer causes

• predict consequences

• Keywords

summarize, describe, interpret,

contrast, predict, associate,

20


distinguish, estimate, differentiate,

discuss, extend

Table 2.1

The lowest levels of intellectual behaviors within the cognitive domain

Identified by Bloom.

21


3. Application • use information

• use methods, concepts, theories in

new situations

• solve problems using required

skills or knowledge

• Keywords

apply, demonstrate, calculate,

complete, illustrate, show, solve,

examine, modify, relate, change,

classify, experiment, discover

4. Analysis • seeing patterns

• organization of parts

• recognition of hidden meanings

• identification of components

• Keywords

analyze, separate, order, explain,

connect, classify, arrange, divide,

compare, select, explain, infer

Table 2.2

The median levels of intellectual behaviors within the cognitive domain

identified by Bloom.

22


5. Synthesis • use old ideas to create new ones

• generalize from given facts

• relate knowledge from several

areas

• predict, draw conclusions

• Keywords

combine, integrate, modify,

rearrange, substitute, plan, create,

design, invent, what if?, compose,

formulate, prepare, generalize,

rewrite

6. Evaluation • compare and discriminate

between ideas

• assess value of theories,

presentations

• make choices based on reasoned

argument

• verify value of evidence

• recognize subjectivity

• Keywords

23


assess, decide, rank, grade, test,

measure, recommend, convince,

select, judge, explain,

discriminate, support, conclude,

compare, summarize

Table 2.3

The highest levels of intellectual behaviors within the cognitive domain

identified by Bloom.

24

2.4 The Discrimination Index

The discrimination index is a useful measure of item quality whenever the purpose of

a test is to produce a spread of scores, reflecting differences in student achievement,

so that distinctions may be made among the performances of respondents. It

measures the extent to which item responses discriminate between individuals who

have a higher overall score on a test and those that get a lower overall score. The

discrimination index is determined by the FAU computer-based test scoring and

analysis system [2.6] automatically, in the following way. The distribution of

students is treated as normal and so the students’ scores are arranged into two sub-

groups [2.7],

• the top 27%; the upper group (U), and

• the bottom 27%; the lower group (L).

The discrimination index for a particular question is defined by the proportion of the

students in the top group who got it correct,

!

pU, and the proportion of the students in

the bottom group who got it correct,

!

pL. The discrimination index is defined as

Lu ppD != .

Note that

!

"1#D # +1. When D = 0, i.e.,

!

pU = pL, there is no discrimination, when

!

D = +1, i.e.,

!

pU =1 and

!

pL = 0 , there is perfect discrimination, and when

!

D = "1,

25

there is inverse discrimination, which is most likely caused by a mis-keyed item.

Thus, discrimination indices

!

" 0 are found on difficult items such that almost

everyone gets them wrong and on items so easy that almost everyone gets them right.

For instructional purposes it is important to know the content areas and type of items

that most students get right or wrong. As mentioned earlier, when multiple-choice

tests are graded using the FAU computer-based test scoring and analysis system,

values of the discrimination indices are obtained automatically [2.6].

2.5 The degree of difficulty

In order to carry out this study, we need a quantitative measure of the “difficulty” of a

question. The difficulty of a question is normally determined from the proportion of

the total group selecting the correct answer to that question. The following formula

may be used to calculate the difficulty factor (sometimes called the p-value):

!

p =c

n"100

where c is the number of students who selected the correct answer and n is the total

number of respondents. A value of

!

p =100% indicates that all the students selected

the correct answer and so that item is very “easy”. A value of 0 indicates that none of

the students selected the correct answer and so that item is very “difficult”. So, this

ratio is one measure of how difficult the question was to the answer.

26

The implication is that if the purpose of the test is to test an individual’s mastery of

the material, i.e., as in a criterion-referenced test (CRT),

!

p values of

!

~ 90% may be

expected. However, if the emphasis is to obtain a spread of scores between

individuals, as is the case in a norm-referenced test (NRT), then

!

p values over a

broad range can be expected, with the greatest spread if all test items have a difficulty

of 50%. If we plot the difficulty,

!

p , against the corresponding discrimination index

for each question of a test, we observe a definite correlation between the two

quantities, as shown in Figures 2.1 to 2.6 for tests 1, 2 and 3; similar behavior is

observed for all tests. First, as

!

p increases, the discrimination index also increases,

but at a

!

p value between ~40% and ~60%, the discrimination reaches a maximum.

When

!

p >~ 60% , the discrimination index decreases. It is generally claimed that

items for which 40% to 60% of the group passes are preferred to those that are easier

(

!

p > 60%) or more difficult (

!

p < 40%) [2.6]. In these particular cases, the number of

items falling into the range

!

40% < p < 60% in tests 1A, 1B, 2A, 2B, 3A and 3B are

8/26, 10/26, 8/31, 10/31, 12/30 and 9/30, respectively.

27

0 20 40 60 80 100

Difficulty

!

p (%)

Figure 2.1. The discrimination index versus the difficulty factor,

!

p , for Test 1A.

0 20 40 60 80 100

Difficulty

!

p (%)


!

p , for Test 1B.

28

0 20 40 60 80 100

Difficulty

!

p (%)


!

p , for Test 2A.

0 20 40 60 80 100

Difficulty

!

p (%)


!

p , for Test 2B

29

0 20 40 60 80 100

Difficulty

!

p (%)


!

p , for Test 3A.

0 20 40 60 80 100

Difficulty

!

p (%)


!

p , for Test 3B.

30

Note that over the range

!

40% < p < 60%, the discrimination index is

!

>~ 0.5 .

Therefore, we will take

!

D = 0.5 as the desirable minimum value for a “discriminating

item”.

The difficulty factor, as defined above, is a property of the obtained measurements.

However, we require a definition that depends on the content of the question and

reflects the difficulty and complexity of the tasks required to find a solution. Thus, we

seek a quantitative and independent measurement of difficulty.

We have found that it is possible to assign a degree of difficulty to items on a

multiple-choice test based on the knowledge and tasks required to solve the problem.

Basically, all questions can be analyzed in terms of a combination of letters and

numbers. The letters represent the tasks or actions that students must perform in

order to obtain a complete solution to the problem; the numbers indicate the number

of times each task or action is performed. In general terms, Bloom’s taxonomy,

described above, classifies the various tasks and actions, e.g., simple memorization

(recall), unit conversion, solving a system of equations, etc., into a hierarchy. Using

the classification system as a guide, we are able to assign a numerical level of

difficulty to each of these tasks, as shown in Table 2.4.

31

Task Level of difficulty

Knowledge and recall (K)

Identification (I)

1

Application (A) 2

Unit conversion (simple) (

!

C3)

Simple equation (E)

3

Unit conversion (

!

C4)

Vector analysis (V)

4

Solving equation (

!

S5), derive (D) 5

Solving a system of equation (

!

S6) 6

Table 2.4

Numerical level of difficulty associated with each task

32

In this way we are able to assign an overall degree of difficulty to each question on a

test as the sum of the individual levels of difficulty encountered to obtain the answer.

In more detail, the tasks in Table 2.4 are:

Knowledge (K) or recall: a task that simply implies memorization or a

definition or a quantity that must be known in order to answer the question.

Identification (I): a task that requires identification of the process, laws or the

equation that must be used in order to solve the problem.

Application (A): a task when the knowledge is applied to a problem

Unit conversions (

!

C3 and

!

C4): are tasks when a unit conversion is done in

completing the problem.

Simple equation (E): describes a task that involves simply inserting numbers

into an equation to obtain a solution.

Vector analysis (V): a task when vector addition or manipulation of vectors is

required in order to solve the problem.

Derivation (D): a task that requires the derivation or proof of an algebraic

expression.

Equation (

!

S5 and

!

S6): tasks that involve the manipulation of one or more

equations before numbers can be input in order to obtain a result.

We provide three examples below.

33

1. Example of 1K question (Test 1A, Q9):

Velocity is a rate of change of

a) Speed

b) Energy

c) Distance

d) Displacement

In order to answer this question correctly, the student should know the definition of

speed. The level of difficulty level of this question is 1.

2. Example of KI question (Test 1A, Q19):

A skydiver jumps from an airplane. As her velocity of fall increases, neglecting

air resistance, her acceleration

a) Increases

b) Is constant

c) Decreases

In order to answer to this question correctly, the student needs to

• Identify the type of motion for a skydiver (uniform accelerated motion)

• Know that acceleration is constant during the motion

Difficulty level of this question is 2.

34

3. Example of 2IAE question (Test 1B, Q18):

What is a speed of an object after 4s, if it falls from the rest with an acceleration

of

!

32 ft/s2 ?

a) 32 ft/s

b) 128 ft/s

c) 256ft/s

d) 384 ft/s

In order to answer to this question correctly, the student has to

• Identify the type of motion (uniform accelerated motion! free fall)

• Apply the formula for the velocity in uniform accelerate motion

!

v = vo + at

• Identify that the initial speed is

!

vo = 0

• Solve the equation for v

Difficulty level of this question is

!

2 "1+ 2 + 3 = 7 .

The main aim of this study is to investigate any relationship between the level of

difficulty of a particular question and the corresponding discrimination index, using

the results of a total of six multiple-choice tests in a physical science course.

35

Chapter 3

Results of the Research

3.1 Description

As mentioned previously, the main aim of this study is to investigate the relationship

between the degree of difficulty of a particular question and the corresponding

discrimination index. The degree of difficulty is defined in Chapter 2 and can be

described as a numerical quantity that depends on the content of the question and

reflects the difficulty and complexity of the tasks and operations required to find a

solution. In this study, we use a combination of letter and numbers to quantify a

complete solution to a question; the letters represent the task(s) or action(s) that must

be performed and the numbers represent the number of times each task or action is

performed. As we described above, we have classified the tasks and actions into a

hierarchy, using Bloom’s taxonomy as a guide, and assigned a numerical degree of

difficulty to each of the tasks. For example, the following question:

The speed limit in a school zone is 20mi/h and it is strictly enforced. If

you are driving at 30km/h are you likely to get a ticket?

(a) Yes

(b) No

36

can be analyzed in the following way. In order to answer to this question the student

should …

• convert km/h to mi/h using the relationship

!

1 mi =1.61 km , i.e.,

!

1 km = (1 1.61) mi = 0.621 mi . This task is

!

C3, a simple unit

conversion with a level of difficulty of 3.

• solve the equation for v:

!

v = 30 km/h " 30 # 0.621=18.6 mi/h . This

task is E, a simple equation with a level of difficulty of 3.

• identify that

!

v < 20 mi/h . This task is I, identification with a level of

difficulty of 1.

Thus, the level of difficulty of this question is

!

3+ 3+1= 7.

The discrimination index measures the extent to which the question discriminates

between individuals who fall into the top 27% of scorers on a test and those who fall

into the bottom 27%. The index, as defined in Chapter 2, which has a value

!

"1#D # +1, is determined automatically for each question on a test by the FAU

computer-based test scoring service. For the purposes of this study, we claim that

questions with values of

!

D > 0.5 qualify as questions that are “reasonable”

discriminators; hence, we only concentrated on such test items in our study.

Altogether, we analyzed the results of six multiple-choice tests (labeled 1A, 1B, 2A,

2B, 3A and 3B) given in a Physical Science class (PSC2121), at Florida Atlantic

University in the Fall 2004 semester and selected only those items for which

!

D > 0.5.

37

3.2 Analysis of the questions for which D>0.5

In Tables 3.1 to 3.6, we list the results of our analysis of the six tests for

which

!

D > 0.5.

In Figures 3.1 to 3.9, we show plots of the degree of difficulty and the discrimination

index for the individual tests. We have included a second order polynomial fit to the

data simply to act as a guide to the eye.

Despite the limited statistics, due to a relatively small number of respondents (~50)

on each test, a trend does appear to emerge. The data for each test suggests that there

is a correlation between the degree of difficulty and the discrimination index.

Specifically, initially, as the degree of difficulty increases the discrimination index

also increases. However, there is an optimum degree of difficulty beyond which the

discrimination begins to fall. (Such behavior was noted previously, in chapter 2,

when the difficulty factor, defined as:

!

p =c

n"100 ,

where c is the number of students who selected the correct answer and n is the total

number of respondents. But, as we argued in chapter 2, the difficulty factor is a

property of the obtained measurements and is not appropriate in our analysis, which is

why we found it necessary to introduce a quantitative and independent degree of

difficulty for each question, based on content of a question and the difficulty and

complexity of the tasks required to find a solution.)

38

We can understand such behavior by identifying the two extremes, namely, (a)

questions that have a “low” degree of difficulty (

!

<~ 8), i.e., questions that are too

easy, and (b) questions with a “high” degree of difficulty (

!

>~ 14), i.e., questions that

are too hard. Questions in these regimes are less effective in discriminating between

students of different abilities because:

in case (a) more of the lower scoring students are likely to answer the

question correctly, so the test item is too easy for both the lower and

higher scorers, resulting in less discrimination, and

in case (b) fewer of the higher scoring students are likely to answer the

question correctly, so the test item is too difficult for both the high

scorers and the low scorers to answer and so it no longer discriminates

effectively.

In spite of the limited size of the data sets, we suggest that, for the tests that we have

analyzed, the optimum discrimination likely occurs when the degree of difficulty lies

in the range from ~9 to ~14. It might be tempting to compare the degrees of

difficulty for optimum discrimination from one test to the next; clearly, if students are

“learning” then we might expect the degree of difficulty for optimum discrimination

to increase! However, the sample set is simply not adequate for reliable comparisons.

These results are very similar to those obtained by Hostetter and Haky who analyzed

39

the results of a number of multiple-choice tests given in an introductory General

Chemistry course [1.4].

A further outcome of this study is that, in principle, it is now possible to design

multiple choice items with a known degree of difficulty and, hence, discrimination.

Finally, in Figure 3.10 we show the correlation between the measured difficulty

factors (

!

p), as defined in Chapter 2, and our calculated degrees of difficulty for Test

1A. In (a) we have used the complete set of values; where there is more than one

measured difficulty factor for a particular degree of difficulty; we have plotted the

averaged value. The plots indicate a close relationship between the measured and

calculated values. When the calculated degree of difficulty is very small, most of the

students get the correct answer, so

!

p"100%; when the calculated degree of difficulty

is very large, most students fail to get the correct answer, so

!

p" 0.

40

Question number Type question Degree of

difficulty

Discrimination

index

2

!

CS5 8 0.61

5

!

2CE 9 0.67

12

!

K3AS6 13 0.58

13

!

K2VS5 14 0.61

17

!

2IKAE 8 0.67

18

!

IAE 6 0.58

25

!

KAVS5 12 0.81

Table 3.1.

The results for Test 1A.

41

Question number Type question Degree of

difficulty

Discrimination

index

7

!

2C3E 9 0.58

9

!

KAVI 8 0.52

15

!

KAS5E 11 0.65

17

!

IKAS5 9 0.66

21

!

3KI3A 10 0.66

22

!

AS5 7 0.51

23

!

5KAI 8 0.52

24

!

K2I2AS5 12 0.64

25

!

KAIVS5 13 0.58

Table 3.2.

The results for Test 1B.

42

Question number Question type Degree of

difficulty

Discrimination

index

2

!

IKA 4 0.55

7

!

2IKAS5 10 0.70

22

!

5K3AI 12 0.69

25

!

KACE 9 0.60

27

!

2I2K2AE 11 0.84

28

!

2I2K2AE 11 0.70

Table 3.3.


43


difficulty

Discrimination

index

5

!

2IAS5 9 0.80

16

!

2KE 5 0.66

22

!

KS52AI 11 0.53

23

!

2K 2 0.55

24

!

2KS53AI 14 0.50

Table 3.4.


44


difficulty

Discrimination

index

1

!

IAE 6 0.52

5

!

2IKA 5 0.50

8 5

2ASK 10 0.61

9

!

KAE 6 0.57

12

!

KAS5 8 0.60

14

!

2I2AS5E 14 0.86

18

!

3I3AES5S6 23 0.60

20

!

K2A2S5 15 0.84

22

!

IDC3E 12 0.77

24

!

K2AES5C3 16 0.59

25

!

2K2AS5 11 0.70

26

!

6K3A 12 0.75

27

!

4KE 7 0.66

28

!

4KA 6 0.50

30

!

KAE 6 0.57

Table 3.5.


45


difficulty

Discrimination

index

6

!

4KA 6 0.57

7 5

2ASK 10 0.65

14

!

2I2AS5E 14 0.75

20

!

K2A2S5 15 0.57

21

!

KAS5I 9 0.65

22

!

IDC3E 12 0.65

24

!

2K2AS5 11 0.66

27

!

4KE 7 0.66

28

!

4KA 6 0.65

29

!

AE 5 0.50

Table 3.6.


46

Figure 3.1. The discrimination index versus the degree of difficulty for Test 1A.

Figure 3.2. The discrimination index versus the degree of difficulty for Test 1B.

47

Figure 3.3. The discrimination index versus the degree of difficulty for Tests 1A and

1B.

Figure 3.4. The discrimination index versus the degree of difficulty for Test 2A

48



2B.

49

Figure 3.7. The discrimination index versus the degree of difficulty for Test 3A.


50


3B.

51

Figure 3.10. (a) the difficulty factor (

!

p) and (b) the averaged

difficulty factor (

!

p av ) versus the calculated degree of difficulty for

test 1A. A linear trend line has been fitted to the data; in (a)

!

R2

= 0.73

and in (b)

!

R2

= 0.85 .

52

Chapter 4

Concluding remarks

In this study we analyzed the questions and results of a total of six multiple-choice

tests in a physical science course at Florida Atlantic University in the Fall 2004. Our

main aim was to quantify two of the most important factors in creating valid and

discriminating test items, namely, the degree of difficulty of each item and the

corresponding discrimination index based the results of actual tests, and to investigate

the relationship between them. Following the analysis of the results of a test, each

item can be assigned a “discrimination index”, which determines how well it

discriminates between the top scoring students of the test and the bottom group of

students. In this study we confined our analysis to the questions from all tests for

which discrimination index is >0.5.

In order to associate a degree of difficulty with each item, we identified the various

tasks or operations, such as memorization and identification, application, unit

conversion, algebraic manipulation, use of vectors, etc., required to answer each

question. We assigned a numeric level of difficulty to each task, based on the range

of knowledge, skill, and ability required, so that any question involving a number of

different steps has an overall degree of difficulty, which is the sum of the individual

levels of difficulty associated with each of the required steps.

53

Our results indicate a definite correlation between the degree of difficulty and the

discrimination index. For example, as the degree of difficulty increases so does the

discrimination index. However, there is a optimum degree of difficulty, in the range

~9 to ~12, beyond which the discrimination index starts to fall. At that point, the test

items become too difficult for both the high scorers and the low scorers to answer, so

the items no longer discriminate effectively. Clearly, there are two extremes;

questions that are too easy, i.e., with a low degree of difficulty, and those that are too

hard, i.e., with a high degree of difficulty. Such questions are not effective in

discriminating between students of different abilities.

By adopting our assigned levels of difficulty for each task or operation, one can

actually design questions with the required level of difficulty and range of cognitive

levels that will result in multiple-choice tests that truly discriminate between students

of different abilities. For example, the results of our study indicate that the most

discrimination questions, i.e., with

!

D > 0.6, have a degree of difficulty level is in

interval

!

9"14 . Using this result and we can set up an “inequation”

!

9 " a #K + b #A + c #E + d #V + e #S5 + f #S6 "14 ,

where K, A, E, V,

!

S5 and

!

S6 , etc., are the various tasks and operations required to

solve a problem, as defined in chapter 2, and a, b, c, d, e, f represent the number of

times each action is performed. We found that it was possible to assign a numerical

level of difficulty to each of these tasks, e.g.,

!

K =1,

!

A = 2,

!

E = 3,

!

V = 4 ,

!

S5 = 5 ,

54

!

S6 = 6, based on a hierarchy of the skills required. Therefore, the inequation

becomes:

!

9 " a + 2b + 3c + 4d + 5e + 6f "14 .

Although there are many possible solutions to this equation, there are, however,

limits. So, in principle, we can use this inequation to develop items for multiple-

choice tests in a physical science course where the requirement is to obtain optimum

discrimination between students who have mastered the course material and those

who have not. However, the design of items with optimum discrimination and the

verification under test conditions is beyond the scope of this study; we suggest it

might form the basis of further research.

55

References

[1.1] Bloom B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. [1.2] Victoria Clegg and William Cashin (1986), “Improving Multiple-Choice Tests” iDEA PAPER No. 16 available from http://www.idea.ksu.edu/papers/ [1.3] “Improving Multiple Choice Questions” (1990) available from http://ctl.unc.edu/fyc8.html. [1.4] Laura Hostetter and Dr. J.E. Haky – private communication. Also, “A classification scheme for preparing effective multiple-choice questions based on item response theory”, L. Hostetter and J.E. Haky, FLORIDA ACADEMY OF SCIENCES, Annual meeting, University of South Florida, March 2005. [1.5] Note that our definition of the degree of difficulty is different from that used by the FAU Testing and Evaluation Center. [2.1] B.S. Bloom (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. There is a considerable amount of information available about Bloom’s taxonomy on the internet, see, for example: http://www.nwlink.com/~donclark/hrd/bloom.html, http://www.coun.uvic.ca/learn/program/hndouts/bloom.html, http://www.valdosta.edu/~whuitt/psy702/cogsys/bloom.html. [2.2] D.R. Krathwohl, B.S. Bloom, and B.M. Bertram (1973). Taxonomy of Educational Objectives, the Classification of Educational Goals. Handbook II: Affective Domain. New York: David McKay Co., Inc. [2.3] E.J. Simpson (1972). The Classification of Educational Objectives in thePsychomotor Domain. Washington, DC: Gryphon House. [2.4] J. D. Bransford, A.L. Brown and R.R. Cocking (eds) (2000). How People Learn: expanded edition. Washington, D.C.: National Academy Press. [2.5] M. Suzanne Donovan and John D. Bransford (eds) (2005). How Students Learn. Washington, D.C.: The National Academies Press. [2.6] Handout entitled Computer based test scoring and analysis is available from the Florida Atlantic University, Testing and Evaluation Center. [2.7] “The selection of upper and lower groups for the validation of test items”, T.L. Kelley, J. Ed. Psych., 30, 17-24 (1939).

A.hotiu Thesis

Documents

Transcript of A.hotiu Thesis