Download - Reproductions supplied by EDRS are the best that can … · the examinee select multiple entities, ... evaluated response formats and scoring for ten ... and can be shown to extend

DOCUMENT RESUME

ED 458 245 TM 033 406

AUTHOR Luecht, Richard M.TITLE Capturing, Codifying and Scoring Complex Data for

Innovative, Computer-Based Items.PUB DATE 2001-04-00NOTE 30p.; Paper presented at the Annual Meeting of the National

Council on Measurement in Education (Seattle, WA, April11-13, 2001).

PUB TYPE Reports Descriptive (141) Speeches/Meeting Papers (150)EDRS PRICE MF01/PCO2 Plus Postage.DESCRIPTORS *Certification; *Coding; *Computer Assisted Testing; Data

Collection; Scaling; *Scoring; *Test Items

ABSTRACTThe Microsoft Certification Program (MCP) includes many new

computer-based item types, based on complex cases involving the Windows 2000(registered) operating system. This Innovative Item Technology (IIT) haspresented challenges beyond traditional psychometric considerations such ascapturing and storing the relevant response data from examinees, codifyingit, and scoring it. This paper presents an integrated, empirically baseddata-systems approach to processing complex scoring rules and examineeresponse data for IIT cases and items, highlighting data managementconsiderations, use of various psychometric analyses to improve the codingand scoring rules, and some of the challenges of scaling the data, using apartial credit item response theory model. Empirical examples from theMicrosoft Innovative Item Types project are used to illustrate practicalproblems and solutions. (Author/SLD)

Reproductions supplied by EDRS are the best that can be madefrom the original document.

1

PERMISSION TO REPRODUCE ANDDISSEMINATE THIS MATERIAL HAS

BEEN GRANTED BY

TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)

U.S. DEPARTMENT OF EDUCATIONOffice of Educational Research and ImprovementjUCATIONAL RESOURCES INFORMATION

CENTER (ERIC)This document has been reproduced asreceived from the person or organizationoriginating it.

0 Minor changes have been made toimprove reproduction quality.

Points of view or opinions stated in thisdocument do not necessarily representofficial OERI position or policy.

Capturing, Codifying and Scoring Complex Datafor Innovative, Computer-based Items1

Richard M. LuechtUniversity of North Carolina at Greensboro

April 2001

EST COPY MIXABLE

1 Paper presented at the Annual Meeting of the National Council on Measurement in Education,Seattle, WA, April 2001. Symposium: Implementing Innovative Measurement Technology in aComputer-Based Certification Testing Program.

1

Abstract

The Microsoft Certification Program (MCP) includes many new computer-based

item types, based on complex cases involving the Windows 2000® operating

system. This Innovative Item Technology (IIT) has presented challenges that go

beyond traditional psychometric considerationsi.e., capturing and storing the

relevant response data from examinees, codifying it, and scoring it. This paper

presents an integrated, empirically-based, data systems approach to processing

complex scoring rules and examinee response data for IIT cases and items,

highlighting data management considerations, use of various psychometric

analyses to improve the coding and scoring rules, and some of the challenges of

scaling the data, using a partial credit IRT model. Empirical examples from the

Microsoft Innovative Item Types project are used to illustrate practical problems

and solutions.

2

Introduction

The Microsoft Certification Program (MCP) includes many new computer-

based item types, implemented using testing software components termed

Innovative Item Technology (IIT). These new item types are named aptly by

their interactive response capabilitiese.g., "create-a-tree", "drag-and-connect",

and "build-list-and-reorder" (Fitzgerald, 2001). A more mundane, academic

description would be that these IIT items require responses that involve having

the examinee select multiple entities, compound sets of entities, and, in some

cases, specify relations among those entities or sets. In any case, these new item

types tend to be based on complex, problem-solving cases involving the

Windows 2000® operating system. The MCP examinations also include

traditional multiple-choice and extended matching questions.

This paper summarizes some of the technical aspects of item-level scoring

and associated item analysis procedures involving the IIT exercises used on the

MCP examinations. For the purposes of this paper, item-level scoring is defined as

the process of codifying the response(s) and other relevant information that the

examinee provides into a numerical quantity that reliably and validly represents

the examinee's performance. For example, a traditional multiple-choice item

codifies everything the examinee does into a simple item score of one if correct or

zero if incorrect. The challenge with more complex item types lies in codifying

the relevant data and information that the examinee provides into a statistically

valid and reliable score.

3

The challenges of scoring complex, computer-based performance exercises

are certainly not new. Many existing computerized performance exercises

involve complex responses and/or performance actions that need to be captured

and codified, with the most salient information extracted for scoring purposes

(Bejar, 1991; Clauser, et al, 1997; Bennett & Bejar, 1999). Luecht & Clauser (1998)

evaluated response formats and scoring for ten operational computerized tests

covering skills such as writing, computer programming, architectural design,

and medical patient management. In many cases, the supposedly "complex"

computerized performance tests require responses that are simply discrete

entries from a list of possible entries. In other cases, the source inputs really are

more complex perhaps constructed responses, essays and other extended text

entries that need to be processed and scored using fairly complicated

algorithms. As we move further along the continuum of possibility in the realm

of computerized testing, we will see even more complex performance and

response data such as speech recognition requiring neural nets for processing

(e.g., Bernstein, Wong, Pisoni and Townshend, 2000). On a complexity

continuum, the IIT exercises lie somewhere between discrete items and some of

more variable free-form entries alluded to above.

This paper is divided into two sections. In the first section, I present a

simple, but generalizable framework for understanding scoring systems and the

underlying data structures for those systems. The scoring system for the MCP

examinations is an instantiation of that framework. In the second section, I

4

present some operational details about how that scoring framework was

implemented at Microsoft, including examples to illustrate various aspects of the

framework. Finally, in the discussion section, I will layout some basic design

principles that may help generalize this framework to other complex

computerized exercises.

A Framework for Scoring Complex, CBT Items

T'here are two basic components in most scoring frameworks: (1)

responses provided by the examinee and (2) answer expressions. The actual

scoring rubrics used by a particular testing problem may also involve special

transformations of the raw response data (e.g., parsing, recoding, or even

modeling via a neural net to produce a set of quantitative values) and compound

logic rules, however, these two basic components are reasonably broad in scope

and can be shown to extend to many different types of test items and

performance exercises.

Examinee Responses

Responses are generated by actions or inactions taken by the examinee

and can vary across a wide spectrum of complexity (Luecht & Clauser, 1998). At

one end of the spectrum are simple text-character responses (e.g., "A"), short

answers, or special component-event actions (e.g., Radio-Buttonl.Value=True).

Most modern computerized tests provide controls that can capture these simple

5

response actions using a state indicator and/or a result value that the control

passes back as the response. For example, to record whether or not an examinee

uses a mouse to "click" a particular control object such as a check box displayed

on the computer, we could record as the response variable of interest the state of

the control object (e.g., given some generic control, we could record

generic_control.state=True or generic_contrastate=False, corresponding to the

"on" or "off" state). We could likewise store the result of a keystroke or edited

text entry (e.g., generic_control.value="My response") At the other end of the

spectrum, there are more complex data structures captured as examinee inputs

(e.g., formatted essay text, computer programs, mathematical proofs, speech

samples) that usually require substantial transformations to simplify the data.

From an abstract perspective, we can think of such custom controls as

transformation functions of the examinees' responsei.e.,

custom_contravalue=firesponse set).

Inactions or non-response can also constitute a response of sorts. In most

cases, we can represent the inaction as a null state or value for the response

control.

Some of the customized response coding procedures we will discuss here

involve determining the nature of the connection between two objects or "nodes"

("drag-and-connect" items), determining the serial position of various response

options relative to other response options (i.e., for "build-list-and-reorder"

items). The actual response is stored as a function of the item for a particular

6

exarninee on a particular test. One way of hierarchically expressing the storage

structure is as follows:

examinee.test.item.response.state=generic_control.state

and/or

examinee.test.item.response.value=generic_control.value.

Once the response.state and/or response.value are recorded, either or both can be

scored. Items may have a single response or multiple responses. Multiple

responses can be stored as multiple, simple transactions between the exarninee

and the item, or, as complex transactions comprised of compound responses

interactions between the examinee and the items.

Answer Expressions

Answer keys and other rubrics are idealized responses or patterns of

responses that can be linked, in whole or part, to an examinee's response in the

form of a transaction. Often, the transaction resolves to "true" or "false" (i.e., a

Boolean comparison operation) and we than assign point values to either or both

alternatives. For example, dichotomous scoring of multiple-choice items

typically uses the logic of comparing a response value (character or state) to a

stored answer for each items. If the response matches the answer, a stored value

(usually one) is assigned, otherwise the examinee receives no points (a value of

zero).

7

At a slightly more technical level, there are really three parts to what we

can call an answer expression: (a) the response or response function being

evaluated; (b) the answer key, answer key sets, or rubrics used to compare to the

responses that we have decided to score; and (c) a logical grammar (e.g.,

Boolean) that carries out the comparison and evaluates to a particular value. For

this paper, I will limit the comparative grammar discussion to standard Boolean

algebra, which covers all of the Microsoft IIT exercises.

One important point to make is there may be many responses, actions or

inactions that we choose to ignore2. Furthermore, answers are not always

"correct". Sometimes answers denote proper responses the examinee should

make (e.g., the usual "correct" answer or a beneficial problem-solving action);

other times answers denote responses the examinee should not make (i.e., so-

called "dangerous" options). The most obvious type of answer key is a multiple-

choice answer key stored in an item database which may hold be a single

character or serial position index referring to one element in a vector of

responses. Answers can also refer to compound response objects, actions or

inactions, patterns of responses, or combinations of actions and responses. For

example,

item.answer.value=" A" + "1" + "C"

2 A multiple-choice answer is a prime example of ignoring certain responses. We score thecorrect answer and essentially ignore all incorrect answers.

8

is a compound answer3. Obviously, more elaborate storage schemes can be

devised, using array structures, etc..

Answer keys may also be expressed as functions of other responses. This

approach, while not specifically used for the MCP examinations, allows some

answers to be dependent on other responses or functions of responses that the

examinee provides. For example, a calculated response in one part of a

performance exercise could be used in future calculations to measure whether an

examinee followed proper procedures and decision-making, regardless of

whether or not the original calculated response was correct or incorrect. If we

denote the other data used in the answer as "other_parameters.values" (for

scoring), then the answer can be expressed as

item.answer.value=ftexaminee.item.response.variable, other_parameter.values).

Obviously, from a certain psychometric viewpoint4, specifically allowing

dependencies among response may be undesirable, but the system ought still be

capable of dealing with them, regardless.

In the remainder of this paper I will use the more generic notation,

"item.expression" in referring to "item answers". This avoids the typical

connotations of answers as always referring to correct responses.

3 This type of concatenation usually assumes some similar transformation function applied to theraw response data to make it comparablee.g., item.response.valuel (examinee.response).4 It is fairly well established that dependencies among scored responses can lead to attenuatedreliability in the classical testing tradition and to violations of the local independence assumptionrequired for certain item response theory models.

9

0

As noted above, comparing a discrete response to a single answer key

using a Boolean "IF f(x,y) THEN assign(value)..." is the typical method of scoring

traditional multiple-choice questions, where assign(value) is usually an equality

such as item.score=item.expression.value. Of course, I am assuming that an

item.expression.value of one or some other value or weight has been stored in the

item database. For example, suppose that an examinee answers "A" to a test

question (item). The scoring system records the response and then matches that

input response to the answer key stored in a database. ff the answer key is

likewise "A", the response-to-answer comparison resolves as "true" and a

specified point value is assigned. Assuming dichotomous scoring, a score of ui =

1 is assigned to the examinee's response if the response matches the answer key,

otherwise a [default] null score of ui = 0 is assigned.

We can write the implicit scoring rule as a generalized boolean statement,

IF item.response=item.expression-xTHEN item.score = item.expression.value

where item.expression-x.value allows for multiple expressions for a given item

and corresponding ".value" statements to be stored as part of the

"item.expression-x" database entity (x=1,...,n expressions for a given item). It is

interesting to note that the incorrect multiple-choice "distractors" can be

assigned individual expression scores as well by simply: (i) assigning additional

expressions identifiers for each item distractor; (ii) writing for each expression a

Boolean logic statement with an assigned "answer" (in this case, an incorrect

10

11

response state or value representing each distractor); and (iii) setting the value of

the expression, if it resolves to "true".

An important IIT achievement was designing a flexible, hierarchical

database representation of the item expressions. That is, the simple "item" +

"expression" construct can be attached to either an examinee test record

(e.g., examinee.test.item.expression) or stand on its own as a item database entity

(itembank.item.expression). The "expressions" were also formalized under IIT as

Boolean logic statements, stored with the individual expression for each item5.

For the visually minded, Figure 1 shows a simple database "table" to

illustrate storage of the item expressions, assuming partial-credit scoring, for a

four-option, multiple-choice item with two "correct" answers, "A" and "D". By

extension, if we wanted to only assign a point for getting both answers "A" and

"D" and not_gtin for just "A" or "D", we could replace expressions 99999.001 and

99999.004 with the expression, 99999.response=("A" AND "D"), and assign a

score value of one or two.

Insert Figure 1 Here

Of course, expressions "99999.002" and "99999.003" (the distractor scoring

rules with weights of zero) add nothing to the test scores. We could develop a

5 In principle, this mechanism can be generalized to any grammar that can be "scripted" or evendelegated to customized "scoring agents", called by name and passed response and answer-based "messages" (e.g., a neural net). Each script or agent must simply return a value (logical orreal-valued). This capability, of course, will probably be active in Version 2.0 of IIT.

11

12

bit more elegant logic by simply replacing the distractor scoring rules with a

negation of the compound rule:

iF Nolitem.response=" A" OR item.response="D")THEN item.score = item.expression.value.

Implementation of Scoring for HT

The scoring engine used by Microsoft is proprietary and secure, for

obvious reasons. However, the principles implemented by that scoring engine

are a direct instantiation of the scoring framework described in the previous

section.

The database structure for the items and answer expressions (see Figure 1,

above) was an important development for IrT. However, there were actually

three processes that allowed Microsoft to implement scoring for IIT. First, there

needed to be a process for identifying the critical response objects or scorable

functions of those response objects (e.g., selecting the proper "nodes" and

relationships among those "nodes"). Second, there needed to be a set of

procedures for efficiently creating the item answer expressions to represent all

potentially scorable incidents. Third, there had to be a way to evaluate and

choose the more effective item answer expressions from multiple possibilities.

12

13

Identifying the Critical Response Functions

In the case of a multiple-choice item, identifying the "responseobjects" is

a rather simple process. Ask the item writer to identify the correct answer.

However, for many of the Microsoft IIT exercises, the response options tend to be

multi-faceted, involving relations among various objects or nodes.

Table 1 contains a breakdown of the primary response components for the

current crop of IIT exercises. This table lists the five primary IIT exercise types

and the response objects for each: (1) one-best answer multiple-choice items

(MCI), (2) extended matching and multiple-best answer multiple-choice items

(MCk); (3) create-a-tree (CT); (4) build-list-and-reorder (BLR); and (5) drag-and-

connect (DC). The MCI and MCk items are very similar to standard multiple

choice items where the examinee chooses one or multiple responses6.

Insert Table 1 Here

6 One note is important regarding the number of response choices that an examinee is allowed formultiple-pick items (MC k). Although it is tempting to allow the examinee to choose "all thatapply", such an unconstrained multiple-choice response format has three consequences. First, itexponentially increases the viable number of combinations of scorable responses. This is easy to

show by an example. Given five answer choices and the mandate tochoose two, there are exactly

five-take-two (i.e., C = 10 response combinations to consider) answer combinations to consider.

Allowing the examinee to choose "all that apply" requires evaluating

Cos + Cs + C5 + C5 + C5 + C5 = 32 combinations. There are obvious storage and processing2 3 4 5

implications. Second, the "choose-all-that-apply" response format makes identifying the mostcritical scorable incidents per item intractable for item writers or scoring committees. Imaginehaving item writers evaluate the full combination of responses for an item with 12 optionstherehappen to be 4,096 viable combinations. Finally, "choose-all-that-apply" items require a scoringsolution to check that examinees simply don't choose all the answers (that is, m take m, possiblychecking for m take m-1 solutions, as well). In any case, this seemingly simple capability ofallowing the examinee to choose "all that apply" usually necessitates messy solutions to aproblem that can be avoided by simply having the examinees choose a particular number ofoptions and having the computerized response handler constrain the number of selections toparticular number (k < m).

13

14

The CT items require the examinee to choose from a list of possible

choices and align those choices under a tree-like set of nodes (Fitzgerald, 2001).

The scoring model is extensible to allow multiple layers of the tree, similar to an

outline (e.g., 1.1.1, 1.1.2, etc.). When limited to a two layers (e.g., 1.1, 1.2, 2.1,

2.2,...), CT items are essentially extended matching items. The BLR items are

primarily intended to measure an examinee's ability to select and prioritize a set

of options. The relationship between any pair of objects can be defined as an

ordered lise (e.g., "AB" is correct, "BC" is not). It is possible to represent

"triples" and higher-order combinations in the final response list by identifying

the objects and their serial position or their position relative to one another.

The DC items consist of two objects, joined pair-wise, be a particular

relationship. The objects can be network components or any relevant objects.

The relationships can be classified as directional or bi-directional and are

typically represented as line or arc "connectors" between the objects (Fitzgerald,

2001). Other relations can include types of real connectors (e.g., communication

or transmission protocols between networks). Higher order joins among the

objects are also possible (e.g., triple clusters comprised of three objects and up to

three connectors). By identifying the objects and connections (relations) in

separate lists, standard graph theory can be applied to identify the "nodes" and

"edges" that form a scorable incident.

7 The actual scoring mechanism used by Microsoft is a bit more sophisticated and efficient thanusing ordered lists, however, the result is about the same.

14

15

Creating the Item Expressions

Given the enormous potential for large numbers of answer expressions for

the MCk, CT, BLR, and DC items, a process had to be developed to efficiently

produce, encode and try out viable expressions by scoring real examinee data.

The solution was actually quite simple and has been effectively used with a

number of MCP examinations over the past year.

Initially, the item writers generate lists of viable expressions to be scored.

Then, a committee of subject-matter experts (SMEs) including the item authors,

if possibleis convened and given two simple tasks. First, they are instructed

identify the response choices or actions that an examinee MUST doi.e., "good"

things. These are deemed positive actions and are assigned "+" on a coding

sheet8. The SMEs are trained to use Boolean statements (or other custom

operator functions understood by the Microsoft scoring engine) to express

positive response choices or actions. Second, the SMEs identify the response

choices or actions that an examinee MUST NOT select or perform. These

negative actions are critically "bad" things and are assigned "" on the coding

sheet, once the expression is created. Item expressions in the initial lists that are

not chosen as "good" or "bad" are dropped. The SMEs are then given the

opportunity to add additional plus or minus expressions to the list. A final

screening is done to attempt to reduce the list of expressions to a fixed number of

15

16

n or fewer expressions, where n is fixed as a "strongly recommended policy" for

the upper bound.

Examples of MC[1] and MC[k] item expressions have already been

provided. An example of two CT item expressions might be: 88888.001: "B2"

and 88888.002: "B6" which identifies nodes 2 and 6 as child nodes of a parent

node, B. Other nodes can be similarly expressed as relate to other "parents"

(e.g., A or C parent node objects). Unions of such relations can express the need

to have both assigned to the same parent; for example, 88888.003: "B2 AND B6".

Other graphs of the relation are possible (e.g., expanding the implicit binary tree

to include "cousins" 2 and 6 and coding the expressions accordingly).

An example of a BLR item expression is 77777.001: "B" BEFORE "A", where

the "BEFORE" operator evaluates the relative sequencing of selected objects in the

response list and resolves to a Boolean result. T'here is no requirement that the

response objects be adjacent in the list, although that level of specificity can

certainly be enforced by using compound statements. Other mechanisms for

evaluating absolute and relative serial position of items in a list are, of course,

possible. Finally, we come to the drag-and-connect item expressions. An

example of a DC item expression for a pair connected pair of objects would be

66666.001: "B & 2 & D", which shows response objects "B" and "D" connected

8 Initial work with the SME-committees indicated that they could not efficiently come toconsensus about logically derived scoring weights for the expressions and spent enormousamounts of time in discussion, nonetheless.

16

17

(joined) by relation "2". Other higher-order object connections could be

represented by multiple pair-wise connections, triples, etc..

The weight values for the item expressions were assigned using a scoring

protocol termed Dichotomous Expression Weighted Scoring (DEWS; Luecht, 2000).

Under the DEWS protocol, examinees only acquire positive points for doing

positive things OR for not doing negative things. In addition, many feasible

incidents can be treated as "neutral" and not scored at all (i.e., implicitly assigned

a weight of zero).

Specifically, positive scorable incidents (expressions) are assigned weights

of one. Negative incidents are recoded as "not doing a bad thing" and likewise

assigned weights of one. Recall that negative incidents involve choosing

incorrect answers or selecting actions that are critically wrong. By inverting the

logic of negative incidents and assigning NOT operator to them, it is feasible to

give credit to everyone for not doing something bad, rather than penalizing the

examinees for the negative action or answer choice. The logical inversion

effectively eliminates negative weighting. The basic Boolean logic is

IF NoT(item.response=item.expression-x)THEN item.score = item.expression.value

where item.expression.value = 1 for both positive and the complementary NOTO of

negative expressions. Therefore, examinees get points for doing good things and

for not doing bad things.

There are two practical advantages of the DEWS protocol. First, DEWS

maps directly into most partial-credit scoring models using number-correct

scoring or item response theory, where the item expression scores are additive

over expressions, summing to a polytomous score for each item. Second, where

necessary, it is possible to reconfigure the expression lists for a given item and

rescore the examination response data, without major complications. Luecht

(2000) discussed several other technical advantages of DEWS over using other

weighting schemes, including some reasons to avoid negative weights typically

associated with "dangerous options".

Evaluating the Effectiveness of Item Expressions

The item expression scores (weight values) are additive in the items and,

by extension, in the test scores. As a result, it is possible to use standard item

analysis procedures to evaluate the contribution of individual item expressions

(e.g., expression percent-correct scores, high-low group percentages, and

expression-test correlations).

In keeping with usual practices, item expressions that are uninformative

or behaving poorly are typically discarded. This serves two purposes: (1) it

streamlines the number of expressions, leaving only the most statistically useful

expressions; and (2) over time, the process of identifying and coding expressions

can ideally be improved by considering what works with the current items.

Figure 2 shows a sample item-expression analysis report for a MCN item.

The top section of the report aggregates the expression-level statistics and reports

the item mean, standard deviation and item-test correlation. Note that the item-

test correlation is a product moment correlation between the expected response

function and the total score, where the latter has been properly adjusted to

remove any autocorrelation with the current item score9. The graphic displays

the mean item scores for "quintiles" (0 to 20th percentile, 21st to 40th percentile,

etc.). The lower section of the report displays the expression-level summary.

The values in the columns labeled "Q1", "Q2", etc., correspond to the

proportions of examinees receiving credit for each expression within each of the

quintiles. Most of the expressions seem to be performing well insofar as

showing a monotonic increase in the expression scores as we move from Q1 to

Q5. The expression for answer "C" (see boxed section) is somewhat less

informative that might be desired, since the scores actually decrease slightly at

Q3 and Q4. By reviewing information like this, informed decisions can be made

about which expressions to retain and which should be discarded.


Figure 3 shows a sample item-expression analysis report for a build-list-

and-reorder (BL) item. The complexity of the expression logic (see callout in the

9 A more recent version of the IIT item analysis software also reports expression-test product-moment correlations to more quickly detect near-zero or negatively discriminating expressions.

19

20

lower section of Figure 3) includes both relative sequencing logic and some

negative options that the examinee should not include in the list ("C" and "D").

Considering the proportions across the quintiles, the first [boxed] expression,

"xxxxxx.01" is less informative than the second expression. Nonetheless, both

expressions are contributing positively to the total score.


Figure 4 shows a sample drag-and-connect (DC) item-expression analysis

report. The second expression, "xxxxxx.02", illustrates how the scoring engine

will give credit for not choosing a particularly complex negative set of nodes and

relations. Both expressions seem to work reasonably well, although the item is

somewhat difficult (i.e., the item percent-correct is 0.703 divided by a maximum

of 2 points, which is equivalent to a p-value of 0.35).


Discussion

The IIT scoring process established for the MCP examinations provides

some important examples of procedural and system-level components. The

process may be specific to the MCP item types, however, there general principles

that that seem necessary when developing an integrated system for scoring

computer-based performance items.

20

21

First, the formal design of the supporting database structures for the

scoring system needs to be done early on in the process. This cannot wait until

the test is nearly operational and the data are rolling in. For the IIT system, the

table-based structure (see Figure 1) proved to be a simple, yet elegant way of

representing the answer expressions and associated scoring weights. That same

structure also seems to generalize well for almost any items that use transactional

scoring protocols. It can further be shown to extend to more complex scoring

models using functions of the responses and answer values or other data stored

in an item-level database.

Second, the scoring system needs to flexibly represent complex response

relationships and compound logic denoting performance expectations. The

discussion of answer expressions and some of the specific examples shown

rather clearly demonstrate that inherent capability of the IIT system.

Third, the process of developing answer keys and answer expressions

needs to be efficient, especially if new items are produced en masse to support on-

going computerized testing. This is one area that can be incredibly costly in

terms of human capital (item writers, SME committees, test editors, etc.). By

moving to a simple decision model for identifying critical positive and negative

actions and responses, combined with empirical summary data to evaluate the

individual expressions (i.e., the item-expression analyses), the MCP/IIT process

has become highly efficient. It currently takes only a day or two to complete for

an entire item pool.

21

2 2

Finally, the scoring process needs to consider additional scaling or

calibration needs. In the present context, the expressions sum at the item level to

produce polytomous scores. Partial-credit IRT calibrations can naturally proceed,

using those scores. Because of the pre-screening of the expressions (i.e., using the

item-expression analyses to sanitize the final set of expressions per item), the

final scores will tend to be well-behaved. This becomes an additional bonus

when attempting IRT calibrations. Although not reported in this paper, a

number of calibrations performed by the author on the "cleaned" item-

expression data have consistently shown exceptional fit of all of the items, using

the Rasch-based partial credit model.

References

Bejar, I. I. (1990). A methodology for scoring open-ended architectural design

problems. Journal of Applied Psychology, 76, 522-532.

Bennett, R. E. & Bejar, I. I. (1999). Validity and automated scoring: It's not only

the scoring. Educational Measurement: Issues and Practices, 17, 9-17.

Bernstein, J. , DeJong, J., Pisoni, D., & Townshend, B. (2000). Two experiments on

automatic scoring of spoken language proficiency. In P. Deleloque (Ed.)

Proceedings of Integrated Speech Technology in Learning 2000. Dundee,

Scotland: University of Abertay, August 2000, pp. 57-61.

Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, L. P. (1997). Development

of automated scoring algorithms for complex performance assessments:

A comparison of two approaches. Journal of Educational Measurement, 34,

141-161.

Fitzgerald, C. (2001, April). Rewards and challenges of implementing an innovative

CBT certification examination program. Paper presented at the Annual

Meeting of the National Council on Measurement in Education, Seattle,

WA.

Luecht, R. M. (2000, April). Microsoft® Windows® 2000 Design Examinations

A Study of Scoring Protocols. Unpublished Microsoft Whitepaper.

Luecht, R. M. & Clauser, B. E. (1998, September). Test models for complex computer-

based tests. Paper presented at the ETS colloquium Computer-based

Testing: Building the Foundation of Future Assessments, Philadelphia,

PA.

Contact Information

Professor Richard M. LuechtUniversity of North Carolina at GreensboroDept. of Educational Research MethodologyCurry Building 209P.O. Box 26171Greensboro, NC 27402-6171

Telephone: 336.334.3473Email: [email protected]

24

'25

Table 1. Scorable Response Components for IIT Items

Item Type Description of Response OptionsMC[11 Choose one of m response control (M=max. distractors)MC [k] Choose k of m response control (m=max. distractors, k5m)

CT Move k of m "node"objects to slots below r of p parents to form ahierarchical tree (i.e., parent and child relations) (km; ip)

BLR Select and serialize k of in objects in an ordered list (km)DC Connect k of m objects (k5m) via c of d relations (km; cl)

Notes: RI One-best answer multiple-choice items.[k] Multiple-best answer, multiple-choice items.

25 26

Item ID ExpressionID

Expression Value

99999 001 99999.response = "A" 1

99999 002 99999.response = "B" 099999 003 99999.response = "C" 099999 004 99999.response = "D" 1

Figure 1. A Sample Database Table for a Four-Option MC Item with TwoCorrect Answers

Item-Expression Analysis - Item Nam

Item Code: WageItem Type: M3

Item Mean: 2.859

Item SD: 1.073

item Max. Pts: 4

MM. Score: 1

Max. Score: 4

Item-Test Com: 0.320

Item-Expression

Code Mean SD

x qui

M31118101 0.844 0.363

Expression: A

M3.1111102 0.512 0.500

Expression: 8

M31181103

Expression: C

( 0.813 0.390

M38.1104 0.691 0.462

Expressim NOT(D OR E)

Item statistics forthe mean empiricalresponse function

___.---------*--.o

1.0

OA

1 2 3 4 5

Expression-Level Statistics

N 01 Q2 Q3 Q4 Q5,

256

256

0.65

0.14

0.84

0.29

0.90

0.61

0.90

0.71

0.92

0.81

256 0.84 0.80 0.75 0.78 0.88

256 0.51 0.69 0.63 0.80 0.83

Figure 2. A Sample Item and Expression Analysis Report for a MC[k] Item

2723

)

Item-Expression Analysis - Item Nam

Nem Code: BMWitem Type: BL

Item Mean:

item SD:

Item Max. Pts:

Min. Score:

Max. Score:

Item-Test Corr.:

1.158

0.749

2

0.332

Item statistics for themean empiricalresponse function

Item-Expression

Code


Mean SD N 01 Q2 0:33 (34 45

BLIN11001

Expressbn.

0.389 0.488 468 0.21

(((E BEFORE B) AND (NOT C)) AND (NOT D))

t 0.39 0.38 0.43 0.53

0.769 0.421 488 0.48 0.70 0.84 0.85 0.98

Expnission: (((B BEFORE A) AND (NOT C)) AND (NOT D)) Expression logic

Figure 3. A Sample Item and Expression Analysis Report for a BL Item

28

2 9

Item-Expression Analysis - Item Naft,

Item Code: MOSItem Type: DC Ite x quir

Item statistics for themean empiricalresponse function

Item Mean: 0.703 15

Item SD: 0.799 1

Item Max. Pts: 2 0.6

Min. Score:0.0

2 3 4 5Max. Score: 2

Item-Test Corm 0.370

Item-Expression

Code


Mean SD N QI 02 Q3 04 Q5

DC01111101

Expression:

DCSOINI02

Expression:

0.258 0.437 256 0.06

(A1C AND 131C)

0.445 0.497 256

NOT(((((A2B OR A2C)OR B2C)OR B2A)OR C2A)OR C2B)

0.08

0.22

0.25

0.39

0.35 0.54

0.61 0.75

Expression logic

Figure 4. A Sample Item and Expression Analysis Report for a DL Item

30

29

10/25/2001 14:08 336-256-0405 ERM 12,EREoy. CrLt., jearjrignouse on Assessment;301 405 434 i; Oct-25-01 10:

U.S. Department of EducationOffice of Educational Research end Improvement (0ERI)

National Library of Education (NLE)Educational Resources Information Center (ERIC)

REPRODUCTION RELEASE(Specific Document)

I. DOCUMENT IDENTIFICATION:

TM033406

ERIC1

Capturing, Codifying and Scoring Complex Data for Innovative, Computer-based Items

Autrtor(a): `Ric-hard M.Corwate Source: Pubfication Date:

II. REPRODUCTION RELEASE:in onkel* dimming% as widely as possible dmely end significant matecists c.4 Interest to the educational community, documents announced in the

nlOntbly abstract loom& of the ERZ system, Renames in Education (RIE), ere usually made wobble to user* In microfiche, reproduced pow copy.and electronic media, and sold through dw ERIC Document Reproduction OWN (EMS). Credit W given to the souroe ar each document and, itreproductionrelessa Is granted, one of the following notices Is greed to the docomant,

tf permission is granted to reproduce and &Nominate die identified docuMent, please CHECK oNe of the following three options and sign at the bottomof the page.

me goof loser %wan soar es toBated to al Loyal I &tomer*.

PERMISSION TO REPRODUCE ANDDISSEMINATE THIS MATERIAL HAS

BEEN GRANTED BY

00'TO ME EDUCATIONAL RESOURCES

INFORMATION CENTER (ERIC)

Lovol

CMG( hote Pa uwe t Mem waningroproomokm rds.s.kaIijI H mbolldto or odor

ERIC enzloto man tc., 111800041) and asowow.

Signhew'please

The IlmsPOI lOddr Mow mow Tx DealltrAd 62 41 Leal 2A Owe** MEW to 112 19v1l 28 docomonts

nut simple loom Nam bas wo Do

PE RMISSION TO REPRODUCE ANDDISSEMIMATE THIS MATERIAL IN

MICROPICHE. AND IN ELECTRONIC MEDIAFOR ERIC COLLECTION SUREcRIBERS ONLY,

HAS BEEN GRANTED BY

stoTO THE EDUCATIONAL RESOURCES

INFORMATION CENTER (ER/C)

2A

Level 2A

ChM rids rlandlA Mow& pawningmomdtallon old dluondrodon mipollohe entl

eledrenlo malls AVM irohhoi colaellonouboolbero only

PERMISSION TO REPRODUCE ANDDISSOLONATE THIS MATERIAL IN

MICROFICHE ONLY HAS BEEN GRANTED BY

2B

TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)

1

Clearers for Leo xe meow, Pennalrilmoomigion ems OmornototIon bi microfiche My

Dmooloom 4I bo possedsss4 as Ms.*. PowSkisti sorseellei emery pinto.II pernission tetaleice Is Wad, Out no bac to dtl, women% IA *named at L. 1.

hereby pant to me Educational ResourCsa information Oballw (ERIC) nonsaalusive penhiaaion reproduse and disseminate mit documentat taietted above, Rogroducbon hum theatiC micitilfda I a t tiactronis mods by moons other than MC ampioyeaa and its systemcontractorsragukespermissIon from thecopydelfeholdsn Etception a made kr nonismlit reproduction byltimanea end other setvice agenciesto satisfy Infortnation mods &educators In rowans. to dievele Ingliktaa.

a,./Vorti etrofith;1, 441-6nvmsbvK

mirc,aff7drit4. Lkeeld- Akfsto--332e, ai 7-9"5' 3C- as-6, oeior

4 he lo/aSIo/(over)

,

III. DOCUMENT AVAILABILITY INFORMATION (FROM NON-ERIC SOURCE):

If permission to reproduce is not granted to ERIC, or, if you wish ERIC to cite the availability of the document from another source, please

provide the following information regarding the availability of the document. (ERIC will not announce a document unless it is publicly

available, and a dependable source can be specified. Contributors should also be aware that ERIC selection criteria are significantly more

stringent for documents that cannot be made available through EDRS.)

Publisher/Distributor:

Address:

Price:

IV. REFERRAL OF ERIC TO COPYRIGHT/REPRODUCTION RIGHTS HOLDER:

If the right to grant this reproduction release is held by someone other than the addressee, please provide the appropriate name and

address:

Name:

Address:

V. WHERE TO SEND THIS FORM:

Send this form to the following ERIC Clearinghouse:ERIC CLEARINGHOUSE ON ASSESSMENT AND EVALUATION

UNIVERSITY OF MARYLAND1129 SHRIVER LAB

COLLEGE PARK, MD 20742-5701ATTN: ACQUISITIONS

However, if solicited by the ERIC Facility, or if making an unsolicited contribution to ERIC, return this form (and the document being

contributed) to:

ERIC Processing and Reference Facility4483-A Forbes BoulevardLanham, Maryland 20706

Telephone: 301-552-4200Toll Free: 800-799-3742

FAX: 301-5524700e-mail: [email protected]

WWW: http://ericfac.piccard.csc.comEFF-088 (Rev. 2/2000)