37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing...

37? /VS/J

S*o. 297

THE GENERALIZATION OF THE LOGISTIC DISCRIMINANT

FUNCTION ANALYSIS AND MANTEL SCORE TEST

PROCEDURES TO DETECTION OF

DIFFERENTIAL TESTLET

FUNCTIONING

DISSERTATION

Presented to the Graduate Council of the

University of North Texas in Partial

Fulfillment of the Requirements

For the degree of

DOCTOR OF PHILOSOPHY

by

Mary E. Kinard, B.A., M.Ed.

Denton, Texas

August, 1994

Kinard, Mary E., The Generalization of the Logistic

Discriminant Function Analysis and Mantel Score Test

Procedures to Detection of Differential Testlet Functioning.

Doctor of Philosophy (Educational Research), August, 1994,

117 pp., 13 tables, bibliography, 80 titles.

Two procedures for detection of differential item

functioning (DIF) for polytomous items were generalized to

detection of differential testlet functioning (DTLF). The

methods compared were the logistic discriminant function

analysis procedure for uniform and non-uniform DTLF (LDFA-U

and LDFA-N), and the Mantel score test procedure. Further

analysis included comparison of results of DTLF analysis

using the Mantel procedure with DIF analysis of individual

testlet items using the Mantel-Haenszel (MH) procedure.

Over 600 chi-squares were analyzed and compared for

rejection of null hypotheses.

Samples of 500, 1,000, and 2,000 were drawn by gender

subgroups from the NELS:88 data set, which contains

demographic and test data from over 25,000 eighth graders.

Three types of testlets (totalling 29) from the NELS:88 test

were analyzed for DTLF. The first type, the common passage

testlet, followed the conventional testlet definition:

items grouped together by a common reading passage, figure,

or graph. The other two types were based upon common

content and common process. as outlined in the NELS test

specification.

Comparison of the LDFA-U and Mantel methods for null

hypothesis rejection yielded similar results, except in the

common content testlets. As expected, no pattern was

evident in comparisons of the LDFA-N to either of the

uniform detection methods. The number of testlets flagged

for DTLF increased as the sample size increased, and most

DTLF was indicated in the common content testlets.

In comparing item significance to corresponding testlet

significance, the situation was considered inconsistent if

the testlet and at least half of the items did not match in

rejection of the null hypothesis. Most inconsistencies

occurred in the common content type of testlets.

Further research was suggested in the following areas:

DIF and DTLF comparison, implicit testlets, polytomous

models, testlet design and scoring, thick and thin matching,

and DTLF post hoc procedures.

TABLE OF CONTENTS

Page

LIST OF TABLES iv

Chapter

I. INTRODUCTION 1

Problem Addressed in the Study Significance of the Study Limitations Research Questions

II. REVIEW OF THE LITERATURE 7

Testlets in Computerized Adaptive Testing Testlet Structure Advantages of Testlets Other Testlet Issues Differential Item Functioning Differential Item Functioning Methods Uniform and Non-Uniform DIF Differential Testlet Functioning Polytomous Response DIF Polytomous Observed-Score Procedures Thick and Thin Matching

III. METHOD OF RESEARCH 32

Mantel Score Test Procedure Logistic Discriminant Function Analysis

Procedure NELS:88 Data Set

IV. PRESENTATION AND ANALYSIS OF DATA 44

Data Analysis Results Research Questions 1 through 4

V. FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS . . . 59

Findings Conclusions Recommendations

i n

Page

APPENDIX 70

A. Testlet Naming Convention 70

B. Testlet Items 72

C. Chi-Square and Probability Values for Testlets and Items 75

D. SPSS Sample Program for Logistic Discriminant Function Analysis 103

E. SAS Sample Program for Mantel Score Test Procedure 105

F. Pascal Sample Program for Mantel-Haenszel Procedure 107

BIBLIOGRAPHY Ill

IV

LIST OF TABLES

Table Page

1. Factors Varied in the Study 32

2. Frequencies: kth Level of Ability Variable . . 35

3. NELS:88 Testlet Types in Each Section 41

4. NELS:88 Number of Items in Testlets 42

5. Non-Matches in H0 Rejection for LDFA-U vs.

Mantel-U (p < .001) 49

6. Chi-Squares and Probabilities for Non-Matches. . 50

7. Non-Matches in H0 Rejection for LDFA-N vs. LDFA-U (p < .001) 51

8. Non-Matches in Hc Rejection for LDFA-N vs.

Mantel-U (E < .001) 52

9. Summary of Significant Chi-Squares (& < .001) . 53

10. Occurrences of Significant Chi-Squares

(E < .001) 54

11. Items Showing Significance at £ < .001 56

12. Inconsistencies Between Testlets and Associated Items 57

13. Ratio of Significant Items to Testlet Items in Flagged Cases 58

CHAPTER 1

INTRODUCTION

The fundamental unit in test construction can be

thought of as a group of related items rather than a single

item. Items which are grouped together by a reading

passage, graph, or other common feature can be referred to

as a testlet. Current literature suggests a trend toward

tests which use the testlet rather than the item as the unit

of analysis (Dillon, Henzel, Klass, LaDuca, & Peskin, 1993;

Haladyna, 1992,* Sireci, Thissen, & Wainer, 1991; Wainer &

Kiely, 1987).

Adaptive testing is a specific area where testlets are

beneficial. Many of the psychometric problems associated

with computerized adaptive testing (CAT), such as context

effects and item ordering, can be alleviated by using the

testlet as the interchangeable unit (Wainer & Kiely, 1987;

Wainer & Lewis, 1990). As a result, the screening of

testlets to be included in a testlet pool requires

generalization of item screening techniques to testlet

screening techniques.

In particular, testing specialists must assure that

units (items or testlets) within the pool for adaptive

testing do not function differently for subgroups of

examinees. In the case of items, each item must be

scrutinized for differential item functioning (DIF)

(Steinberg, Thissen, & Wainer, 1990). For example, when

males and females have been matched on the

ability being measured by the test, any item which shows

differential impact between genders is eliminated from the

item pool. Correspondingly, testlets must be examined for

differential testlet functioning (DTLF).

In this study some appropriate statistical methods

currently used for DIF are generalized to DTLF. The hope is

that test developers will use methods explored in this study

to ensure test fairness as more tests use the testlet as the

unit of analysis. Testlets may be screened for differential

functioning using data from fixed testlet-based tests.

Those testlets which do not show DTLF may be included in

future testlet pools for selection during testlet-based

adaptive testing.

Problem Addressed in the Study

Simply using DIF methodology for the detection of DTLF

is inappropriate. The items which are grouped because of

common subject matter, such as a reading passage, are not

independent of one another. There is an item dependency

within testlets which must be considered in psychometric

analyses. In other areas of psychometrics, such as

reliability studies, it has been shown that treatment of

dependent items as independent ignores an important

statistical component and results in erroneous conclusions

(Sireci et al.( 1991).

In the past year, research regarding new methodologies

for detection of DIF in polytomously scored items has been

reported (Miller & Spray, 1993; Welch & Hoover, 1993; Zwick,

Donoghue, & Grima, 1993). A polytomous score has nominal or

ordinal level response capability, as opposed to dichotomous

scoring where the score has only two possibilities, correct

or incorrect. Most of the previous DIF research has been

focused upon dichotomous scoring only (Angoff, 1993; Berk,

1982; Osterlind, 1983). Generalization of polytomous DIF

methodologies to DTLF detection is the next logical step.

Only one research study (Wainer, Sireci, & Thissen,

1991) has been reported in the area of differential testlet

functioning (DTLF). The researchers adapted a method (Bock,

1972) which is potentially less powerful than some of the

other polytomous DIF procedures. The response level was

analyzed as nominal, thereby wasting information available

in the testlet scores. The testlet score used in analysis

may be considered ordinal or interval level.

There is a need for studies which analyze DTLF by

generalization of polytomous DIF methods which use more of

the information available in testlet scores. In this study

an attempt is made to compare two of the potentially most

powerful polytomous DIF procedures as generalized to

exploration of DTLF. The methods compared are the logistic

discriminant function analysis and the Mantel score test

procedures.

Significance of the Study

Test fairness has become an important issue in

psychometrics in the past decade, both for test makers and

for test takers. Cole (1993) reported that this concern

began rapid growth in the 1960s, when opportunity for

equality in systems such as education and employment were

visibly questioned. The role of the testing community in

the equality effort is to be a neutral reporter of what it

finds.

Wainer summarized three major components of analysis to

assure test fairness: (a) Reviews of each item by subject

matter experts and demographic subgroup representatives; (b)

comparisons of validity characteristics for each major

subgroup; and (c) for each item, extensive statistical

analysis of relative performance of major demographic

subgroups (Wainer, 1993). This study concerns extensions of

the third component.

By exploring techniques to analyze relative performance

using two of the current major psychometric trends,

computerized adaptive testing and testlets, this study

contributes to the leading edge of both theory and

application of psychometrics in the attempt to make the

testing process fair for every individual.

Limitations

This study was limited in complexity by the following

factors. Two polytomous observed score methods for DTLF

detection were used: the Mantel score test procedure and the

logistic discriminant function analysis procedure. Three

stratified random samples of 500, 1000, and 2000 were

compared, and were drawn by demographic subgroups from the

NELS:88 data set (National Center for Educational Statistics

[NCES], 1991). Linear explicit and implicit testlets from

the NELS:88 test were analyzed for differential functioning,

by whole testlet and by single items within each testlet.

Uniform and non-uniform DTLF were sought in the testlets in

the NELS:88 data set.

Research Questions

To fulfill the purpose of this study, the following

research questions were answered.

1. Do the Mantel score test of conditional

independence procedure and the logistic discriminant

function analysis procedure detect the same differential

testlet functioning in the same testlets?

2. Do the Mantel score test of conditional

independence procedure and the logistic discriminant

function analysis procedure detect both uniform and non-

uniform differential testlet functioning to the same extent?

3. To what extent does variation in sample size

influence detection of differential testlet functioning?

4. How do the results of differential item

functioning differ from differential testlet functioning

when the Mantel score test of conditional independence

procedure is used for both analyses?

CHAPTER 2

REVIEW OF THE LITERATURE

The fundamental unit in test construction can be

thought of as a group of related items rather than a single

item. Items which are grouped together by a reading

passage, graph, or other common feature can be referred to

as a testlet. Current literature suggests a trend toward

tests which use the testlet as the unit of analysis.

Wainer and Kiely (1987), who coined the term testlet,

suggested substituting the testlet for the single item as

the unit of analysis in test development. They defined a

testlet as "a group of items related to a single content

area that is developed as a unit and contains a fixed number

of predetermined paths that an examinee may follow" (p.

190).

Traditionally, the testlet format has been used for

reading tests where many items share a common reading

passage and for other tests such as mathematics or science

tests where a few items refer to the same diagram. However,

psychometric analyses have been focused upon only the item

as the unit of analysis until recently.

Indications are that future tests will be based upon

larger tasks in order to mirror real-world tasks more

8

closely. Testlets provide a more realistic testing

situation for a world where tasks are interrelated rather

than separated. The testlet-based test is a better

representation of the performance being assessed (Sireci,

Thissen, & Wainer, 1991).

Recent school reform literature has been critical of

multiple choice items which tend to measure trivial recall

cognition levels. The testlet format is generally believed

to measure higher level thinking than a separate item format

(Haladyna, 1992).

Testlets in Computerized Adaptive Testing

One current use of testlet-based testing is in the area

of adaptive testing, particularly in increasing use of

computerized adaptive testing (CAT). Adaptive testing is

not new. In the early days of testing, a subjective and

expensive method of individualized testing was used. The

examiner made knowledgeable judgements about a test-taker's

proficiency level according to the person's responses to

items. With a skillful examiner, the focus of guestions was

at or near the person's ability level. This testing method

was one of the first types of adaptive tests (Wainer &

Kiely, 1987).

As computers have moved out of the basement and on to

the desk top, it is not surprising to find the testing

community taking advantage of the new capabilities

available, particularly for adaptive testing. The basic

idea behind computerized adaptive testing is that each

examinee receives a customized set of items, geared directly

and accurately to the individual's proficiency level on the

trait that is being measured.

In the most general CAT method, four steps are

followed: (a) An item of medium difficulty chosen from a

large pool of items is presented first. (b) Depending on

whether the response is correct or incorrect, an initial

estimate of the ability and accuracy is calculated,

according to the algorithm that has been programmed.

(c) The next item is closer to the ability level of the

examinee. In general, more-difficult items are given after

correct responses and easier items are given after incorrect

responses. At each step, the item in the pool which gives

the most information about the person's ability is chosen

next. (d) The process continues until some stopping point

is reached. The stopping rule may be based upon a specified

level of accuracy (standard error), a maximum number of

items, or a maximum amount of time.

Item pools are put together by skilled test developers

who must satisfy such requirements as spanning a certain

difficulty range and making certain items are free of

differential functioning. The items are then calibrated by

using an item response model, and the item's estimated

parameters are tabled for use during the CAT (Wainer &

10

Kiely, 1987). It follows that testlets must be screened for

inclusion in testlet pools for an adaptive test which is

testlet based rather than item based.

Adaptive tests typically require less time and fewer

items than do traditional tests. CAT can be a more accurate

assessment of proficiency level, particularly at the upper

and lower extremes of the ability scale (Hambleton, Zaal, &

Pieters, 1991).

Testlet Structure

The structure of testlets in a testlet based adaptive

test may be hierarchical, linear, or a mixture of

hierarchical and linear. A hierarchical branching structure

"routes examinees to successive items of greater or lesser

difficulty depending on their previous responses and

culminates in a series of ordered score categories" (Wainer

& Kiely, 1987, p. 190). In a linear structured testlet, all

examinees respond to all items, from the first to the last.

Depending upon the purpose, a test may be mixed by combining

hierarchical and linear testlets. For example, two

hierarchical tests being joined linearly are useful when

different content areas are included in the same test, and

each person begins at the same starting point within each

hierarchical testlet. Examples of mixed testlet designs are

described in Wainer and Lewis (1990).

11

Two-Stage Testing

A routing testlet is a linear testlet, ordered by

difficulty level, which provides an initial estimate of an

individual's ability level. The examinee is then routed to

one of several second-stage tests, chosen as a function of

the estimated ability from the routing test. The second-

stage test may be either a linearly or a hierarchically

structured testlet. This design, called two-stage testing,

is a popular example of the mixed structure.

Advantages of Testlets

The current algorithmic methods used in CAT are not

without problems. Wainer and Kiely (1987) identified three

difficulties associated with CAT that are alleviated with

the use of testlet-based tests: context effects, item

ordering, and content balancing. Testlets add a

manageability factor in each of these areas.

Manageability

Because a limited number of paths exist, test

developers can more carefully scrutinize individual tests or

paths. Then problems which become evident can be more

easily corrected.

Context Effects

When one item affects a person's response on a

subsequent item, the items are not independent. For

12

example, the earlier item may give a clue or answer to a

subsequent item. This is not a problem when all examinees

receive the same test. However, with adaptive testing, some

examinees could receive the second item without having the

advantage of the first item which contains the hint.

Certainly those who receive the first item have an unfair

advantage over those who do not. This is sometimes referred

to as a problem of cross-information.

In changing the unit of analysis from the item to the

testlet, the boundary effects are reduced. If a linear

testlet is used, only the first item in the testlet has an

unknown predecessor. Within the testlet, the developers

make certain that the problem of cross-information is

unlikely to occur.

Item Ordering

Traditionally, test items are ordered from easier to

harder items (referred to as power tests). In the power

test sequence, persons with lower ability levels are

encouraged by initial success and work harder on subsequent

items. However, in adaptive testing the item ordering

concept is different. Maximum efficiency in CAT is obtained

when the initial item is of medium difficulty and the

following items are chosen as a function of the person's

responses. Persons below the middle proficiency level do

not have the encouragement of early success.

13

Starting points are controlled when testlet methodology

is used. When the item ordering within a linear testlet is

determined by a human developer, the ordering effect

problems are lessened. Because examinees with similar

ability levels take almost identical tests with a testlet

model, item order effects are relatively constant within

levels, and localize the effects. The scores of examinees

who are further apart on the ability continuum show less

confusion of relative ability estimates. (When the

hierarchical structure is used, however, the problem of

ordering still exists.)

Content Balancing

Test developers are traditionally careful to follow

formal content specifications in balancing content areas.

In addition, tests are scrutinized for informal content

imbalance; for example, word problems in a section could

refer to too many sports-related topics. In adaptive

testing, where all examinees do not have the same set of

items, it is more difficult to be certain that the content

is balanced, both formally and informally.

By using the testlet model, test developers can be more

certain that both formal and informal content specifications

have been followed. If a test violates the assumption of

unidimensionality because of the inclusion of separate

14

content areas, then a series of hierarchical testlets solves

the problem of multi-dimensionality.

Testlet models have other advantages over the variable

branching models more commonly used in adaptive testing.

Some of the advantages are discussed in the following

section.

Independence Assumption

Testlet models allow the conditional independence

assumption to remain unviolated. There is independence

between testlets, but not within testlets (Rosenbaum, 1988).

In a recent study, Dillon, Henzel, Klass, LaDuca, &

Peskin (1993) used patient case clusters in a medical

licensure program to determine whether there was higher

intercorrelation between case-related items than random sets

of items. The correlations between items in a case cluster

were significantly higher than correlations between random

sets of items or sets matched on perceived content but from

different cases. The researchers concluded that there was a

high level of dependence within case clusters.

Review of Items

In an item-based CAT, the efficiency of the adaptive

algorithm is compromised if examinees are allowed to return

to items previously answered and revise their responses.

However, if linear structured testlets are used as the basis

of the CAT, examinees can be allowed to review items within

15

the current testlet without compromising adaptive

efficiency. After the review, the next testlet is

adaptively chosen. This corresponds to a paper-and-pencil

test in which an examinee can review within one section, but

not after the next section is started (Wainer, 1993).

In short, testlets provide a middle ground between

traditional test theory and current adaptive testing

methodology. In traditional test theory, the unit of

analysis is the entire test (too big). In variable

branching adaptive testing, the unit of analysis is the

single item (too little). In testlet models, the unit of

analysis is a bundle of items (just right).

Other Testlet Issues

Testlet Reliability

Reliability of testlet-based tests is overestimated

when item-based methods are used to compute reliability.

Thissen, Steinberg and Mooney (1989) analyzed the responses

of 3,866 examinees on a reading comprehension test. The

test was made up of 22 items, divided unequally among 4

passages. Clearly, the item-based reliability estimates

(0.86 to 0.88) are much higher than the testlet-based

estimates (0.76 to 0.80). Each testlet-based reliability is

0.08-0.12 lower than the corresponding item-based estimate.

For the same example, the Spearman-Brown formula was

used to estimate how many testlets would need to be added to

16

make up the difference in reliability. It was found that

the test length had to be doubled to increase reliability to

a comparable testlet-based reliability of 0.87. Of course,

the testlet-based reliability is more appropriate and far

more accurate than item-based reliability. As more testlet-

based tests are used it will be important to use appropriate

statistical methods for estimating reliability (Sireci et

al., 1991).

Polytomous Models

Previous examples have used dichotomous models, where

each test item is scored either right or wrong. But in

current item-analysis research, several polytomous models

are being tested.

A comparison of some dichotomous models and a

polytomous testlet model was made in the context of

investigating testlet reliabilities (Crehan, Sireci,

Haladyna & Henderson, 1993). One finding of the study

reinforced previous research in that testlet reliabilities

were lower than single-item reliabilities. But the most

significant finding involved comparison of polytomous to

dichotomous test information functions. The polytomous

model resulted in providing much more information at the

lower end of the proficiency scale. This is particularly

significant for cut-score certification decisions, where

17

many decisions are made at the precision level which showed

the most differentiation.

Differential Item Functioning

One of the most prominent current issues in

psychometrics is test fairness. Test developers are asked

to assure that test items function equally for all examinees

of the same proficiency level, regardless of group

membership. The phrase differential item functioningr or

DIF, refers to the study of that functionality.

If each test item in a test had exactly the same item response function in every group, then people at any given level 9 of ability or skill would have exactly the same chance of getting the item right, regardless of their group membership. Such a test would be completely unbiased. This remains true even though some groups may have a lower mean 0, and thus lower test scores, than another group. In such a case, the test results would be reflecting an actual group difference and not item bias. (Lord, 1980, p. 212)

Originally called item bias, the more neutral term of

differential item functioning better describes the concept.

If an item performs differently for two groups, it does not

necessarily mean it is showing prejudice against one of the

groups, a connotation which often arises from the term bias.

It may simply mean that different traits are being measured.

DIF focuses upon statistical properties of a set of test

responses, with the idea of having a unidimensional test, or

that each item measures the same trait or ability for all

examinees.

18

DIF is a relative term. An item may perform differently for one group of examinees relative to the way it performs for another group of examinees. The examinee group of interest is the FOCAL group, and the group to which its performance on the item is being compared is the REFERENCE group. In general, there will be several FOCAL/REFERENCE pairs of groups for which DIF analysis can be made. (Holland & Wainer, 1993, p. xiv)

If the difference in performance on an item is measured

between unmatched groups, the result is not DIF, but instead

is a measure of impact. Impact has been defined as "the

difference between the focal group and the reference group

of the probability of getting the studied item correct"

(Wainer, 1993, p. 134).

Ordinarily, all the items on a test are examined for

DIF, one at a time, with the current item of interest being

called the studied item. One of the basic underlying

concepts of DIF methodologies is that examinees of equal

ability are being compared on responses to an item. The

criteria used to match individuals between groups is usually

the total test score, which is assumed to be the most

accurate measure of the trait or ability being measured by

the item being studied.

Differential Item Functioning Methods

Many statistically rigorous and efficient procedures

have been developed and used in the past few years for the

detection of DIF. In this section, some of the more common

methods for detection of single-item DIF for dichotomously

19

(right/wrong) scored items are summarized. Each method

falls into one of two categories: those based upon observed

score and those based upon latent trait.

Observed Score Methods

One of the first observed score methods, developed by

Cleary and Hilton in 1968, was the analysis of variance

(ANOVA) procedure, which uses interaction between item and

group to flag the presence of DIF. Although the ANOVA

method is easy to use and understand, it requires large

sample sizes and may be inaccurate when groups differ in

achievement level (Camilli & Shepard, 1987).

In the 1970s, a method called delta-plot or transformed

item-difficulty (TID) was developed by Angoff (1972). The

delta-plot method describes a set of items as unbiased if

the item difficulty values (^-values) for each group are

perfectly correlated. (P-values are defined as the

percentage of examinees answering an item correctly.) Using

classical test theory, the ^-values for each group are

calculated and transformed into deltas, using a mean of 13

and a standard deviation of 4. A 45-degree ellipse is

fitted to the bivariate graph of the pairs of deltas (one

point for each item), revealing DIF items as outliers. The

distance between an item and the major axis of the ellipse

indicates the amount of DIF. Although delta-plot analyses

20

are simple and inexpensive, highly discriminating items may

be flagged falsely as DIF items (Angoff, 1982).

A currently popular observed-score method which is

statistically powerful, easily understood, and inexpensive

to compute is the Mantel-Haenszel (MH) procedure. This

procedure was originally used in a study of disease by

Mantel and Haenszel (1959) and applied to DIF analysis by

Holland and Thayer (1988).

The MH procedure divides the examinees into several

intervals which are normally based upon the total test

score. The focal and reference groups are considered to be

matched on the ability most relevant to the ability measured

by the studied item. For each interval, a 2 x 2

contingency table is formed which shows frequencies of

correct and incorrect items for the focal and reference

groups. The ratio of the odds that the reference group

answered correctly to the odds that the focal group answered

correctly is calculated for each interval. The procedure

then estimates a common odds ratio across all matched

categories. The MH statistic (with a continuity correction)

is distributed approximately as a chi-square statistic with

one degree of freedom (Dorans & Holland, 1993).

The standardization method (Dorans & Kulik, 1986), is

similar to the Mantel-Haenszel procedure. However,

standardization uses differences between the p-values of the

groups at each interval, and it applies weights to the p-

21

value differences for each interval (Angoff, 1993).

Although large sample sizes are required, standardization is

a "flexible, easily understood descriptive procedure that is

particularly suited for assessing plausible and implausible

explanations of DIF", according to Dorans and Holland (1993,

p. 38) .

The MH and standardization approaches are both based

upon earlier chi-square procedures offered by

Scheuneman(1979), and modified by Marascuilo and Slaughter

(1981), and by Shepard and Camilli (1981). The major

improvement of MH and standardization over previous chi-

square procedures was in providing a measure of the amount

of DIF.

Mellenberg (1982) described a chi-square procedure

using loalinear and loait models for contingency tables.

Unlike the Mantel-Haenszel and other chi-square methods, the

loglinear/logit procedure is able to make a distinction

between uniform and non-uniform DIF, which are discussed

later.

In 1990, Swaminathan and Rogers suggested the

application of logistic regression analysis for the

detection of DIF. The probability of a correct response is

given by a logistic formula which uses a regression equation

as the exponent of e. The coefficients in the regression

equation, estimated with the maximum likelihood method, are

used as indicators of DIF. Like Mellenberg's (1982) method,

22

the logistic regression procedure is able to detect non-

uniform DIF. As a further improvement, it does not break

the continuous ability parameter into intervals. Treating

ability as continuous rather than categorical results in a

more powerful method.

Latent Trait Methods

The item response theory (IRT) model is the foundation

for latent trait methods. The three-parameter logistic (3-

PL~) model is comprehensive, addressing the differences

between groups with respect to item difficulty,

discrimination, and guessing. Other methods ignore DIF with

respect to guessing and discrimination, making the three-

parameter IRT method the theoretically preferred method,

over observed score methods and other latent trait methods.

Its use is inhibited, however, because of the requirements

for large sample sizes, special computer programs, and

costly run times, as well as the complexity of conceptual

understanding and the difficulty in meeting assumptions.

Item response theory is modeled by an s-shaped item

characteristic curve (ICC), where the abscissa represents

the latent ability continuum and the ordinate shows the

probability of answering the item correctly. Each of the

three parameters is represented visually. The point of

inflection lies directly above the ability level equal to

item difficulty b; the slope at the point of inflection is

23

proportional to discrimination a; and the lower asymptote

represents the probability c that an examinee with no

ability will correctly guess the answer.

If the ICCs for the two groups differ, it is assumed

that the item contains DIF. The most common measures of the

magnitude of DIF used with the three-parameter method are

the area between the curves and the tests of equality of the

three parameters across the groups.

Another latent trait model, the Rasch model (similar to

the one-parameter IRT model) considers only the difficulty

parameter. The discrimination parameter is set to a

constant, implying that all items discriminate equally. The

Rasch model is less complex, less expensive to run, and does

not require large sample sizes.

Uniform and Non-Uniform DIF

A major goal of modern test theory is

unidimensionality, or having all items on the test measure

only one trait. It has been hypothesized that DIF occurs

when the item is measuring one or more secondary traits for

one of the groups. An example is an item on a mathematics

test that inadvertently measures an ancillary trait of a

verbal nature (Mellenberg, 1982; Swaminathan & Rogers,

1990).

When two groups differ consistently on the primary and

secondary traits, uniform DIF occurs. However, if the

24

abilities are inconsistent between the two traits, non-

uniform DIF is present. As Mellenberg (1982) described the

distinction, uniform DIF "means that the group difference in

the second trait is constant across the main trait.

Nonuniform (DIF) implies that the group difference in the

additional ability depends on the main ability" (p. 115).

In IRT terms, non-uniform DIF is evident when the trace

lines (or ICCs) are not parallel. A further distinction can

be made between ordinal and disordinal non-uniform DIF. The

ICCs cross in the middle of the curves in the disordinal

case. If, however, the lines cross at the lower or higher

end of the ability continuum, or even past either end, then

ordinal non-uniform DIF is indicated.

A major disadvantage of the MH procedure compared to

the logistic based chi-square procedures is that MH is less

powerful in the detection of non-uniform DIF. However, the

distinction is lessened in the case of ordinal non-uniform

DIF (Swaminathan & Rogers, 1990).

Differential Testlet Functioning

As more tests are developed with the testlet structure,

it becomes increasingly important to investigate

differential functioning using the testlet rather than the

item as the unit of analysis. An essential component in

testlet-based computerized adaptive testing is a testlet

pool containing DIF-free testlets. Testlets must be

25

screened for differential functioning before being approved

for the testlet pool.

Having only DIF-free items in a testlet does not

necessarily mean that the testlet as a whole is free of

differential functioning, called differential testlet

functioning or DTLF. Only one research study regarding DTLF

has been reported.

The DTLF Study

Attempts to define differential testlet functioning and

to derive a statistical method to detect it were made by

Wainer, Sireci and Thissen in 1991. The nominal response

model developed by Bock (1972) was used by the researchers.

First, the model was fit to both the reference and focal

populations assuming there was no DTLF. Then the same model

was fit allowing DTLF to exist. If there was not a

significant difference between the two models, it was

assumed that no DTLF existed.

Wainer et al. (1991) discussed three advantages of

analyzing testlets for DTLF over simply using individual

item analysis methods: (a) The model for analysis matches

the manner in which the test is constructed. If items are

to be administered as a unit, then the items should be

analyzed that way. (b) Consideration of an aggregate

measure of DIF in testlet-structured tests allows small

amounts of item DIF to cancel within the testlet. It should

26

be emphasized that cancel out means something quite

specific. It means that there will be no DIF at every score

level within the testlet. (c) Applying DIF analysis at the

testlet level may uncover some DIF that was not evident at

the item level. "The increased statistical power of dealing

with DIF at the testlet level provides us with another tool

to ensure fairness" (Wainer et al., 1991, p. 199).

PolYtomous Response DIF

In the past few decades, most DIF research has been

focused upon dichotomous item scoring, where the item is

simply marked right or wrong. Recently, however, more

studies involving polytomous scoring of items have appeared,

particularly because of a new emphasis on performance

assessments. In polytomous scoring, an item is given a

number-correct score or is classified as one of several

unordered choices rather than a right/wrong score. Miller

and Spray (1993, p. 107) defined polytomous responses as

"item responses which are scored on a nominal or ordinal

scale and which consist of more than two categories."

Statistical methods offered for polytomous DIF analysis

range from entirely new methods to modifications of

dichotomous DIF procedures.

The polytomous model used by Wainer et al. (1991) in

the DTLF study, called Bock's nominal model, was developed

for scoring nominal level categorical responses by Bock

27

(1972). The testlet score was simply the number of items

correct in the testlet, ranging from zero to the maximum

number of items. Wainer et al. expressed an interest in

expanding the information provided in the testlet score by

using possible patterns of responses instead of number-

correct, but the raw score was chosen as a simpler starting

point, because the area of DTLF research is just beginning

to develop (Thissen et al., 1989).

In an effort to limit the scope of this study, only

observed score models were considered for comparison. A

discussion of multiple-category latent trait models was

offered by Thissen and Steinberg (1986), who organized the

methods into a proposed taxonomy.

Polytomous Observed-Score Procedures

Recently developed polytomous DIF models, which have

been applied to performance assessment DIF analysis, can be

applied directly to DTLF research, as in the Wainer et al.

(1991) study with Bock's (1972) model. The testlet score

can be used in place of the performance assessment score.

Mantel and GCMH

Two polytomous extensions of the dichotomous MH method

were explored for analysis of DIF in performance assessments

by Zwick et al. (1993). First, the Mantel Score Test of

Conditional Independence proposed by Mantel (1963) takes

into account the ordering of responses. The accompanying

28

statistic has an approximate chi-squared distribution with

one degree of freedom. Second, the Generalized Cochran-

Mantel-Haenszel statistic (Mantel & Haenszel, 1959; Somes,

1986), termed CGMH, is a multivariate generalization of the

dichotomous MH. The CGMH considers responses at the nominal

level only (Agresti, 1990, pp. 234-235, 283-284). Again,

the MH-based procedures do not have the power to detect non-

uniform DIF, and both require that the ability parameter be

treated as categorical and unordered.

Logistic Regression

Three polytomous adaptations of the logistic regression

procedure for dichotomous items have been proposed by

Agresti (1990). Each treats response categories as ordered,

as in the Mantel Score Test, rather than nominal, as in the

Bock and CGMH models. The adaptations are complex and

require a separate model estimation for each ordered

category (minus one), which makes interpretation of results

difficult. However, two characteristics which make logistic

regression a powerful choice are ability to detect non-

uniform DTLF and treating ability as continuous rather than

categorical (Agresti, 1990; Hosmer & Lemeshow, 1989; Miller

& Spray, 1993).

Logistic Discriminant Function Analysis

A proposed polytomous method which has the powerful

advantages of logistic regression without the complexity is

29

logistic discriminant function analysis. Again, the ability

variable is treated as continuous, and non-uniform

differential functioning is detectable. The difference is

in the choice of the dichotomous dependent variable.

Whereas logistic regression uses item response (dichotomous)

as the dependent variable, logistic discriminant analysis

uses group (dichotomous). The discriminant procedure

requires only one regression equation per testlet, because

the testlet response is an independent varicible. Other

independent variables are ability score (the matching

variable) and ability-by-response interaction. The

inclusion of an interaction term is used to flag non-uniform

DTLF. The discriminant procedure is very flexible and

allows other independent variables, such as an external

matching variable (Miller & Spray, 1993). Because it is a

logistic procedure, the assumptions of multivariate

normality and equal variance-covariance matrices are not

required, as they are in linear discriminant analysis

(Norusis, 1990).

T-Test

Another set of polytomous methods are the combined it-

test statistics (called HW1 and HW3), proposed by Welch and

Hoover (1993). As in the MH-based procedures, an ability

score is divided into categories. Based upon an assumption

of homogeneity of variances of scores at each ability level,

30

the HW1 statistic tests the difference between the means of

the focal and reference groups summed across the levels.

The HW3 procedure differs in use of a weighting procedure to

balance unequal sample sizes at each ability level. When,

in a simulated study, the Mantel Score Test was used for

comparison, the HW3 statistic appeared to control Type I

errors as well as the Mantel, but demonstrated more power in

identifying DIF items.

Thick and Thin Matching

In the MH and other chi-square procedures, there is

some question about using each possible score to stratify

the ability continuum, called thin matching. (For example,

if the possible scores range from 0 to 30, then there are 31

ability levels.) In MH procedures some data can be wasted

by overly fine matching, because any row or column with a

0 frequency cell is eliminated in calculations.

Creating fewer levels by pooling test scores, termed

thick matching, was shown to be a more accurate predictor of

DIF when the MH chi-square statistic was used, in a recent

study (Donoghue & Allen, 1993). The results may be

generalized to the Mantel chi-square statistic, because a

dichotomous use of the Mantel reduces to the MH statistic

without the continuity correction (Zwick et al., 1993).

In the Donoghue and Allen simulation study the

researchers compared different degrees of pooling, or

31

thickness of matching. One method, the total percentage

matching strategy, was the most effective for the MH chi-

square procedure. In total percentage matching, similar

numbers of examinees are allocated to each pooled level of

the total test score. For example, for five levels, score

intervals are combined to approximate quintiLes of the

combined sample or the focal group sample.

CHAPTER 3

METHOD OF RESEARCH

The design of this study called for variation of five

major factors, as shown in Table 1. The first variable

factor was the DTLF detection model, with two polytomous

models chosen. Next, the issue of uniform and non-uniform

DTLF was explored by comparing models within the logistic

Table 1

Factors Varied in the Study

Factors Number Variations

DTLF detection model 2 Mantel score test model Logistic discriminant

analysis model

DTLF/DIF consistency between groups 2 Uniform

Non-Uniform

Class of testlets 3 Explicit: Common passage Implicit: Common content Implicit: Common process

Unit of analysis 2 Single items within a testiest

Testlet

Sample size 3 500 1000 2000

32

33

discriminant procedure. Another factor was the strategy for

grouping items to form testlets. Also, single items within

each testlet were analyzed for DIF and the results were

compared with overall testlet DTLF. Finally, three sample

sizes were compared. The chi-square statistics and

associated probabilities were reported for each possible

combination of the factors.

Mantel Score Test Procedure

A polytomous extension of the dichotomous Mantel-

Haenszel (MH) procedure (Mantel & Haenszel, 1959) was

proposed in 1963 by Mantel. The test of association between

groups matched on a conditioning variable weis developed for

the case of ordinal categories. In the case of testlets,

each testlet on the test is individually scrutinized for

DTLF using the Mantel score statistic. The testlet being

investigated is termed the studied testlet.

As in the other chi-square based procedures, the

combined sample is divided into stratifications based upon a

conditioning variable assumed to represent overall ability

in the trait being measured. Then the focal and reference

groups are considered to be matched on ability. Most often

the conditioning variable is the total test score, although

an external criterion is sometimes used. Typically, the

sample is broken into ability groups based upon scores on

34

the conditioning variable, with the number of strata being

the number of possible scores. Of course, there are other

stratification options, as explained in the literature

review section in Chapter 2 (Donoghue & Allen, 1993).

The score for each testlet is the number of items

answered correctly. If the testlet has g items, then there

are (g + l) possible response scores, allowing for the

possibility of no answers correct. An index (in this

example J) which ranges from 1 to (g + 1) is used, so that

the zero category does not have zero weight. Weighting by

score index is used to account for ordinal scores in the

Mantel procedure.

Frequencies on a studied testlet are organized into a

2 x J x K contingency table, where J represents ordered

response categories, and K is the stratification (ability)

level. The 2 x J portion for the kth stratification level

is illustrated in Table 2. Cell frequencies (n for number

of subjects) for a subset of examinees who are considered to

be matched on the overall ability of interest are also shown

in Table 2. The plus sign (+) indicates summation across a

row or column. There is a 2 x J table at each of the K

ability levels for the studied testlet.

Ordering of response categories is taken into account

by assigning weights to the focal group frequency in each

category, according to an ordered index for that category.

35

Table 2

Frequencies: kth Level of Ability Variable

Ordered Index to Testlet Score

Group Yx y2 Y 3 . . . Yj Total

Focal F̂2K F̂3k • • F̂jk F̂+k

Reference R̂lk R̂3k nEjk R̂+k

Total n+lk n+2k n+3k • • ri+jk n++k

The summary chi-square proposed by Mantel (1963), with one

degree of freedom, is

( Z F * - £ * w > Mantel %2 = —* -k - (i)

where Fk is the weighted focal group frequency, defined as

jnFjk ' (2 )

where y-j is the ordered index to the testlet score. The

expectation of Fk is

36

E (Fk) = Ikli^y.n+Jk , (3) n++k j

and the variance of Fk is

Var(Fk) = [(n„1£ytn,jk)-(£yjn,jk)2] . (4)

fl-++k(fi++k 1' - J

The Mantel statistic follows a chi-square distribution

with one degree of freedom. The null hypothesis states that

the ratio of the odds of answering the item correctly for

the reference group to that of the focal group is one. A

rejection of the H0 suggests that the focal and reference

groups differ in performance on the studied item even when

matched on ability. In other words, a large chi-square with

a small probability flags a testlet as potentially

containing DTLF (Agresti, 1990; Mantel, 1963; Welch &

Hoover, 1993; Zwick, Donoghue, & Grima, 1993).

Logistic Discriminant Function Analysis Procedure

The Mantel procedure assumes no three-factor

interaction and therefore is not powerful in the detection

of non-uniform differential functioning (Swaminathan &

37

Rogers, 1990; Welch & Hoover, 1993). Logistic discriminant

function analysis was used to detect uniform and non-uniform

DIF in a recent empirical study by Miller and Spray (1993)

which included one polytomously scored section.

In the discriminant procedure, the probability of group

membership G (focal and reference) is modeled as a function

of two explanatory variables: ability score X, and testlet

response score U.

The full logistic discriminant model can be written as

Prob(G\X,U) = -if- , (5) l + ez

where Z is the linear combination

Z = p0-piz+p2y+p3z*a . (6)

The 6's are coefficients estimated from the data, and X * U

represents the interaction between the ability score U and

the testlet score X.

To assess the fit of the model, a likelihood statistic

Ga is calculated. Norusis provided a clear description of

the model fit. statistic.

38

The probability of the observed results given the parameter estimates is known as the likelihood. Since the likelihood is a small number less than 1, it is customary to use -2 times the log of the likelihood (-2LL) as a measure of how well the estimated model fits the data. A good model is one that results in a high likelihood of the observed results.

To test the null hypothesis that the observed likelihood does not differ from 1 (the value of the likelihood for a model that fits perfectly), you can use the value of -2LL. Under the null hypothesis that the model fits perfectly, -2LL has a chi-square distribution with N - p degrees of freedom, where N is the number of cases and p is the number of parameters estimated. (Norusis, 1990, p. 52)

Testing for non-uniform DTLF involves first fitting the

full model which combines Equations 5 and 6. Then the model

is reduced by deleting the interaction term, and equation 6

is replaced by

Z = P0+P1^+P2£7 . (7)

The significance of S3 is tested by calculating the

difference of fit between the full and reduced models.

Next, the test for uniform DTLF involves fitting the

null model, which contains only the ability score X. In the

null model, Z becomes

Z = p0 + M • (8)

39

If there is a significant difference of fit between the null

and reduced models, the testlet is flagged as potentially

containing uniform DTLF.

For each model the G2 statistic (-2LL) is calculated.

Differences in G2 values between pairs of models (symbolized

as G2(Jiff) tests the null hypothesis that the coefficient

deleted at the last step is zero. The S2di« statistic is

distributed as chi-square with one degree of freedom, and is

comparable to the F-change test in multiple regression

(Miller & Spray, 1993; Norusis, 1990).

NELS:88 Data Set

Demographic and test data for this study were obtained

from the National Education Longitudinal Study of 1988

(NELS:88), an existing publicly accessible data set

sponsored by the National Center for Education Statistics

(NCES). Four study components constitute the base year

design: surveys and tests of students, and surveys of

parents, school administrators, and teachers.

A two-stage stratified probability design was used to

select a nationally representative sample of schools and

students for the NELS:88 data set. The base year sample is

composed of approximately 24,600 eighth graders who were

sampled from 1,052 schools throughout the United States.

The NCES long-range plan is to monitor the transition of the

40

students through high school and then to college or

employment.

The NELS:88 User's Manual describes confidentiality

safeguards:

The NELS:88 base year data is released in accordance with the provisions of the General Education Provision Act (GEPA) and the Carl D. Perkins Vocational Education Act. The GEPA assures privacy by ensuring that respondents will never be individually identified.

To ensure that the confidentiality provisions contained in PL 100-297 have been fully implemented, procedures commonly applied for disclosure avoidance in other government-sponsoring surveys were used in preparing the data tape associated with this manual. These include suppressing, abridging, and recoding identifiable variables. Every effort has been made to provide the maximum research information that is consistent with reasonable confidentiality protections. (NCES, 1990. p. iv)

The NELS:88 data include student responses to a battery

of tests in four subject matter areas: reading,

mathematics, science, and social studies. The tests include

21, 40, 25, and 30 items, respectively.

The NELS:88 test contains eight common passage or

explicit testlets. Alternately, implicit testlets may be

formed by grouping items according to common content area or

common process. Test specification charts in the NCES

(1991) report list content areas for each item in all four

sections, and process areas for items in three of the

sections. For example, Item 28 in the mathematics section

41

is part of the arithmetic content area and the problem

solving process area.

The number of testlets in each testlet classification

and in each of the sections is shown in Table 3. In the

reading section, the defined content areas are the same as

the common passage testlets. The science section contains

no explicit testlets, and no process areas are defined for

the social studies section.

Table 3

NELS:88 Testlet Types in Each Section

Testlet Type

Common Common Common Section passage content process

Reading 5 — 3

Mathematics 2 5 3

Science — 4 3

Soc. stud. 1 3 —

Total 8 12 9

The number of items in each of the testlets are shown

in Table 4. Generally, the common passage testlets have

fewer items than the common content or common process

testlets.

Table 4

NELS:88 Number of Items in Testlets

42

Testlet Type

Section Common Common Common Section Passage Content Process Total

Reading 5,3,6,4,3 — 4,14,3 21

Math. 2,2 11,4,19,2,4 17,19,4 40

Science — 8,7,2,8 8,10,6 25

Soc. stud. 5 3,14,13 — 30

Total items 116

Internal consistency reliabilities based on coefficient

Alpha for each section were reported. The reliabilities

were quite acceptable in reading, mathematics, and social

studies (0.84, 0.90, and 0.83, respectively). The science

test showed less reliability with a coefficient of 0.75. A

factor analysis also indicated that the science section was

less unifactorial than the reading, mathemeitics, and social

studies sections.

Differential item functioning analyses for ethnic and

gender groups were performed on all items by the Educational

Testing Service (ETS), using the MH procedure. Thin

matching was used for stratification, with the total section

score used as the matching variable. Very little DIF was

43

evident in the analyses, with the most being found in the

social studies area (NCES, 1991).

Brookshire (1993) investigated the presence of

differential item functioning in the NELS:88 test data,

using the MH procedure and thin matching. In that study,

the demographic subgroups identified were geographic region,

socioeconomic status, and urbanicity (urban, suburban, and

rural) designations. Similar to the ETS analysis, most of

the DIF was discovered in the social studies section.

CHAPTER 4

PRESENTATION AND ANALYSIS OF DATA

The purpose of this study was to compare applications

of two statistical methods in the detection of differential

testlet functioning (DTLF). The 29 testlets were

categorized into three groups: common passage, common

content, and common process. Only the first category,

common passage, includes testlets which fit the standard

definition of testlet, one where items are grouped together

on the test to answer questions related to the same passage,

figure, or case study. The other two categories are implied

testlets.

Because all 29 testlets were analyzed with three

different sample sizes, 87 testlet/sample size possible

cases were considered. In this chapter, the 87 possible

combinations are referred to as cases.

Naming conventions for the testlets are shown in

Appendix A. The items corresponding to each testlet (with

testlet types) are listed in Appendix B.

Over 600 chi-squares were calculated during the data

analysis stage. For each of the three sample sizes, the 29

testlets were analyzed using three different types of chi-

squares, and the 116 individual items were analyzed using

44

45

one type of chi-square. All chi-square values with

associated probabilities are listed in Appendix C.

For each research question, appropriate tallies were

made according to the comparison addressed by the question.

The tally sheets reduced the massive amount of data into

summaries. Exceptions or inconsistencies were marked on the

tally sheets. The marked chi-squares were then analyzed to

determine the degree of inconsistency.

The value of a = .001 was chosen as the level of

significance. With the Mantel chi-square statistic,

rejection of the null hypothesis indicated that the groups

differed significantly on testlet performance, even when

matched on underlying ability. When using the uniform

logistic discriminant chi-square statistic, rejection of H0

indicated that the probability of group membership differed

significantly with the addition of testlet score into the

equation. For the non-uniform logistic situation, the

interaction between testlet score and observed score was

added for consideration. In all three statistics, rejection

of the null hypothesis flagged the testlet as potentially

containing DTLF.

Data Analysis

Sample Data

The data analyzed were drawn from the NELS:88 data

base. Each of the 8 explicit testlets and the 21 implicit

46

testlets (as described in Table 3) were analyzed for uniform

differential testlet functioning, using both the Mantel and

the logistic discriminant procedures. The logistic

discriminant procedure was used to detect non-uniform

functioning. Each of the 116 single items in the test were

analyzed for differential item functioning for comparison

purposes.

The SAMPLE command in SPSS (SPSS Inc., 1990) was used

to select samples of examinees for the reference group and

the focal group, by chosen demographic subgroups. Gender

was chosen as the demographic variable, with males as the

reference group and females as the focal group.

Logistic Discriminant Function Analysis Procedure

The logistic regression procedure from SPSS was used to

perform logistic discriminant function analysis (Hosmer &

Lemeshow, 1989; Miller & Spray, 1993; Norusis, 1990). The

SPSS program is listed in Appendix D.

The dichotomous dependent variable was demographic

subgroup (gender). The independent variables were total

section score, testlet score, and the interaction (product)

of total section score and testlet score. The total section

score was considered the best available measure of overall

ability, the ability score.

To account for the collinearity between the total

section score and the testlet score with the interaction

47

product, the raw scores on the two variables were centered.

Centering consists of placing variables in the deviation

score form so that their means become zero (Aiken & West,

1991).

Three models were fit to the data. First, the full

model includes all three independent variables. Second, the

reduced model deletes the interaction variable. Last, the

null model includes only the constant and the total section

score. For each model the G2 (-2LL) model fit statistic was

computed.

The improvement statistic G2diff between the full and

reduced models was estimated. If the G2dlff statistic was

significant, the null hypothesis of no improvement was

rejected, and non-uniform DTLF was suspected.

Similarly, G2difS between the reduced and null was

calculated. Uniform DTLF was suspected if the statistic

showed significance.

Mantel Score Test Procedure

For the Mantel procedure, the total section score was

considered the best measure of overall ability for the trait

being measured by the testlets and the items within that

section. That score was used to stratify the sample

according to the total percentage matching procedure, with

the combined (reference and focal) percentages used for

calibration (Donoghue & Allen, 1993).

48

The FREQ procedure in SAS with the CMH option (SAS

Institute, 1990) was used to calculate the Mantel chi-square

statistics and the accompanying probabilities (Zwick et al.,

1993). The SAS code is listed in Appendix E.

Mantel-Haenszel Procedure

For individual items, the Mantel-Haenszel procedure was

used. This procedure is almost identical to the Mantel

score test procedure, with a slight adjustment in the

calculation of the chi-square (Holland & Thayer, 1988). A

Pascal program was written to calculate the chi-square value

of the individual items. The program is listing is shown in

Appendix F.

In Mantel-Haenszel calculations, variable choices were

the same as in the Mantel procedure. The total section

score was used to stratify the sample, with combined

percentages used for calibration (Donoghue & Allen, 1993).

Results

This section includes referrals to percentages in both

the tables and the discussions. All percentages have been

rounded to the nearest whole percentage.

Research Question 1

Do the Mantel score test of conditional independence

procedure and the logistic discriminant function analysis

49

procedure detect the same differential testlet functioning

in the same testlets?

For this research question, the uniform LDFA result and

the Mantel result for uniform DTLF were compared for each

testlet at each sample size. The counts of cases where the

two statistics were inconsistent with regard to rejection of

the null hypothesis are shown in Table 5. Of the 87

different testlet/sample size cases, only 3 of the compared

pairs failed to match in H0 rejection at the 0.001 level of

significance. All three marked cases fall into the common

content category and are in the sample of 1,000.

Table 5

Non-Matches in H, Rejection for LDFA-U vs. Mantel-U fp<.0011

Sample size

Testlet Type 500 1000 2000 Ratio

Common passage 0:8 0:8 0:8 0:24 (0%)

Common content 0:12 3:12 0:12 3:36 (8%)

Common process 0:9 0:9 0:9 0:27 (0%)

Total 0:29 3:29 0:29 3:87 (3%)

The values of the chi-squares and associated

probabilities for the marked cases are shown in Table 6.

Although the Mantel chi-squares were not rejected at the

50

0.001 level, the chi-square statistics are relatively high,

and the probabilities are relatively low.

Table 6

Chi-Sauares and Probabilities for Non-Matches

Testlet N

LDFA-U

Chi-Sq. Prob.

Mantel

Chi-Sq. Prob.

T22 1000

T28 1000

T29 1000

* 13.271 .0003

* 18.442 .0000

* 13.402 .0003

9.852 .0017

7.765 .0053

10.12 .0015

E < .001

Research Question 2



procedure detect both uniform and non-uniform differential

testlet functioning to the same extent?

For this research question, two comparisons were made.

For both comparisons, the LDFA non-uniform chi-square (the

only chi-square statistic used to flag non-uniform DTLF) was

used as the basis for comparison. The non-uniform statistic

was compared to each of the uniform DTLF detection methods,

regarding whether or not the null hypothesis was rejected.

(See Appendix C.)

51

First, the non-uniform LDFA statistic was compared to

the uniform LDFA statistic, regarding whether or not the

null hypothesis was rejected. Counts of cases showing

inconsistency in rejection of H0 are indicated in Table 7.

A case is counted when one, but not both, of the chi-squares

for a case is significant at the 0.001 level.

Table 7

Non-Matches in EL Rejection for LDFA-N vs. LDFA-U (pc.OOl^

Sample size


Common passage — — 3 3:24 (13%)

Common content 1 4 4 9:36 (25%)

Common process — 2 5 7:27 (26%)

Total 19:87 (22%)

Data in Table 7 show that 19 out of a possible 87

testlet/sample size possible cases, or 22%, did not match in

rejection of the null hypothesis when the non-uniform

statistic was compared with the logistic uniform statistic.

The number of marked cases increased with the increase in

the size of the sample. Matches in H0 rejection were not

expected in comparing uniform with non-uniform, because a

testlet may contain differential functioning of only one

52

type. About 13% of the common passage cases, as well as 25%

of common content and 26% of common process options, are

marked.

Second, the non-uniform rejections were compared to the

Mantel uniform rejections. The cases where one was rejected

but not both are counted in Table 8.

Table 8

Non-Matches in H. Rejection for LDFA-N vs. Mantel-U (pc.0011

Sample size





Total 16:87 (18%)

The percentage of non-matching cases in Table 8 is

approximately 18%, which is similar to results presented in

Table 7. Total number of cases increased as sample size

increased. Again, matches were not expected. The testlet

types show variation in Table 8, with 13% of common passage

cases marked, 17% of common content, and 26% of common

process cases.

53

Research Question 3

To what extent does variation in sample size influence

detection of differential testlet functioning?

The occurrences of significant chi-squares are

displayed in Table 9 according to sample size. At the 0.001

level of significance, only 1 testlet out of 29 (3%) was

flagged for differential testlet functioning in the sample

of 500. As the sample size increased, more cases were

flagged, with 8 testlets out of 29 (28%) showing DTLF in the

sample of 1,000, and 17 out-of 29 (59%) in the sample of

2,000. Overall, approximately 30% of the 87 possibilities

indicated one or more types of DTLF.

Table 9

Summary of Significant Chi-Sauares fp<.001^

Sample size

Testlet

type 500 1000 2000 Ratio




Total ratios 1:29(3%) 8:29(28%) 17:29(59%) 26:87 (30%)

Data in Table 9 show that 17% of the common passage

cases were marked, while 42% of the common content and 26%

of the common process types of cases were flagged. Common

54

Table 10

Occurrences of Significant Chi-Squares ("pc.OOll

Sample size

Testlet Type Testlet 500 1000 2000

Common T2 NUM passage

T3 — — — — N

T9 — N

T26 — UM

Common T12 N content

T15 UM UM UM

T19 — NUM NUM

T20 — — N

T21 — NUM N

T22 — U NUM

T28 — U NUM

T29 — U NUM

Common T6 «. . . N process

T8 — — — N

T23 — — N

T24 — N N

T25 — N N

Note: Statistical methods: N = Non-Uniform LDFA-N, U = Uniform LDFA-U, M = Mantel Uniform

55

content testlets show the most potential DTLF, and common

passage testlets show the least.

The statistical methods with significant chi-squares

are coded in Table 10. Only non-uniform DTLF was flagged in

common process testlets, with a variety of DTLF in other

types of testlets.

Research Question 4

How do the results of differential item functioning

differ from differential testlet functioning when the Mantel

score test of conditional independence procedure is used for

both analyses?

This is the only research question which addresses the

116 individual items which make up the testlets, and

possible occurrence of differential item functioning (DIF).

The items with significant chi-squares at the 0.001

level of significance are listed in Table 11. As the sample

size increased, so did the number of significant chi-

squares, with most items occurring in the sample of 2,000.

In comparing item significance to corresponding testlet

significance, the Mantel-Haenszel chi-square for items and

the Mantel chi-square for testlets were used. (See Appendix

C.) The situation was considered inconsistent if the

testlet and at least half of the items did not match in H0

rejection.

Table 11

Items Showing Significance at p<.001

56

Sample size

Section 500 1000 2000

Reading — — RE 6

— — RE 12

— — RE 13

Mathematics — — MAI 2

— MA20 MA20

— MA 2 5 MA25

— — HA 2 8

Science SC4 SC4 SC4

— — SC12

— — SCI 5

Social Studies — — SSll

— SS12 SSI 2

SS21 SS21 SS21

The instances showing inconsistency between testlet and

item H0 rejection are counted in Table 12. Twelve out of 87

instances (about 14%) show some degree of inconsistency.

The common content cases showed more inconsistencies

did than the other two types. Ten (28%) of the common

content testlets were marked, compared to two and one of the

other types (8% and 4%).

57

Table 12

Inconsistencies Between Testlets and Associated Items

Sample size




Common process — 1 — 1:27 (4%)

Total ratio 12: 87 (14%.)

For the inconsistent cases flagged in Table 12, it is

interesting to note how many of the items were flagged for

DIF. The number of items flagged for DIF, out of the number

of items on a testlet, are shown in Table 13. Only the

testlets in Table 12 were analyzed.

58

Table 13

Ratio of Significant Items to Testlet Items in Flagged Cases

Sample size

Testlet type Testlet 500 1000 2000

Common passage

T2

T26 — —

1:3

0:5

Common content

T15

T19

0:4 1:4

0:8

1:4

1:8

T21 * 1:2 — * 1:2

T22 — — 2:8

T28 — — 1:14

T29 — — 0:13

Common process

T18 — * 2:4 —

Note: In most cases, the testlet was significant and the items were not significant. The * indicates that the items were significant and the testlet was not significant.

CHAPTER 5

FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS

The purpose of this study to was compare results of

distinctive statistical methodologies in searching for

differential functioning in test items and testlets. Five

factors were varied: statistical method, uniformity,

testlet type, unit of analysis, and sample size. Subjects

were randomly chosen from the NELS:88 data base of over

25,000 eighth-grade students. Scores were analyzed for

differential functioning using programs written in SPSS,

SAS, and Pascal.

The EXCEL spreadsheet was used to organize the data.

Comparisons were made on the five factors that were varied

in the design of the study.

In this chapter, the logistic discriminant procedure

which compares the full model with the reduced model and is

used to detect non-uniform functioning is termed LDFA-N.

Similarly, the procedure used to detect uniform functioning

which compares the reduced model with the null model is

termed LDFA-U.

59

60

Findings

Research Question 1



procedure detect the same differential testlet functioning

in the same testlets?

In general, the two methods showed similar results.

Less than 4% of the cases failed to match when the Mantel

and LDFA-U results were compared. The three cases which

were rejected by the LDFA-U and not by the Mantel were close

in the chi-square values and probabilities.

All cases in the common passage category, the eight

explicit testlets, matched in all three varieties of sample

sizes. Therefore, using the most common definition of a

testlet, the results indicate that the two methodologies had

perfect consistency in detection of differential testlet

functioning.

All three cases which failed to show consistency were

in the second category, the common content testlets. These

were testlets with items which were not presented together

but contained the same type of content. In the other

implied testlet type, common process, all cases showed

consistency in comparison of the statistical methods.

61

Research Question 2



procedure detect both uniform and non-uniform differential

testlet functioning to the same extent?

Of the three processes used to inspect testlets for

DTLF, only the LDFA-N was expected to reveal non-uniform

DTLF. Two procedures were presumed to reveal uniform DTLF:

the Mantel uniform and the LDFA-U.

In both situations of comparing LDFA-N results with the

two uniform procedures, the uniform and non-uniform failed

to match in 18% to 22% of the comparisons. As anticipated,

the uniform and non-uniform results were not consistent.

When the LDFA-N results were contrasted to the uniform

LDFA-U results, the findings were almost identical with

comparison of LDFA-N to the Mantel uniform results. Only 3

cases out of 87 showed inconsistency. All three cases were

in the second category of testlet types, the common content

testlets. No common passage or common process cases showed

inconsistency between the two comparisons. In all three

instances of incompatibility, the LDFA-U approach flagged

the testlet where the Mantel did not, possibly indicating a

stronger method in the logistic discriminant technigue.

62

Research Question 3

To what extent does variation in sample size influence

detection of differential testlet functioning?

In general, the number of testlets indicating

differential testlet functioning increased as a function of

sample size. As the sample size increased from 500 to 1,000

to 2,000, the percentages of testlets showing possible DTLF

increased from 3%, to 28%, and finally to 59%.

Most of the DTLF occurred in the implicit types of

testlets: those where items were spread throughout the tes.t

and not grouped by a common passage. The testlets implied

by common content, the second testlet category, had the

highest rate of DTLF, with 42% of the cases flagged. Those

cases with item grouping implied by common processes had a

rate of 25%, while only 16% of the conventional common

passage cases showed possible DTLF.

Research Question 4

How do the results of differential item functioning

differ from differential testlet functioning when the Mantel

procedure is used for both analyses?

The Mantel testlet outcomes were compared with the

Mantel-Haenszel item outcomes. The greatest percentage of

inconsistencies, 28%, occurred in the common content type of

testlets, with the other two types showing only 8% and 4%.

63

In most instances of inconsistency, the testlets were

significant where less than half of the items were

significant. There is no clear explanation for this

phenomenon. In one of the two marked cases in the standard

common passage testlets, none of the five items had

significant chi-squares and yet the testlet was selected.

This was the only research question to address

individual items. As in testlet results, the increase in

the sample size was positively correlated with the increase

in the number of items selected. In the sample of 500, only

2 items were flagged for DIF. As the sample increased to

1,000, the same 2 items and 3 more were indicated. In the

largest sample, the Mantel-Haenszel procedure selected the

same 5 items and 8 more, making a total of 13 items with

potential differential functioning. The flagged items in

the largest sample were rather evenly spread among the four

sections of the NELS:88 test: reading, mathematics,

science, and social studies.

Conclusions

This study contributes new information to the

literature of differential testlet functioning as well as

verifying previous research. The common capabilities of the

LDFA-U method and the Mantel method in detecting DTLF was

revealed once again, as was the lack of the power of the

Mantel score test procedure to detect non-uniform

64

differential functioning. As the sample size increased, the

number of testlets with suspected DTLF and of items with

potential DIF also increased.

Of particular interest was the abundance of anomalies

in the second category of testlets, the common content

testlets. These implied testlets were composed of items

which were not physically grouped together on the test, but

merely contained similar content. Most previous research

involving testlets has used only common passage testlets,

which are grouped together with a reading passage, picture,,

or case study. The common passage testlets generally

performed as expected in this study. But the implied

testlets showed inconsistent performance.

Perhaps the most interesting finding was the

inconsistency between some of the testlets and associated

items regarding detection of differential functioning. Only

two of the standard common passage testlets showed

inconsistency, but there is no indication of a reason for

such erratic results.

Recommendations

There is a scarcity of research available in the area

of testlets, and particularly of differential testlet

functioning. This study opens up several possibilities for

a number of research projects.

65

DIF and DTLF Inconsistencies

Why do some testlets show a likelihood for differential

functioning when none of the items, or only a small

percentage of the items, show no such likelihood? The need

to scrutinize possible explanations and form hypotheses

exists. Then studies based upon the hypotheses can be

planned and performed.

Post Hoc Tests

If a testlet is flagged as a potential carrier of DTLF,

there is no clear follow-up procedure for verifying the

degree of differential functioning. Post hoc procedures are

needed for test developers to further screen testlets for

inclusion on tests.

Implicit Testlets

Very few, if any, previous studies have used testlets

which are defined by common content or common process, as

opposed to the standard definition of a common passage.

More research is needed in this area.

Polytomous Models

Polytomous DIF models are appropriate for DTLF

exploration methodologies. If dichotomous scoring

procedures are used with testlet scores (for example, pass

or fail), then much statistical information is wasted.

66

Two polytomous DIF models, the Mantel score test method

and the logistic discriminant function analysis method were

chosen for this study. Comparison on the variables of

uniform/non-uniform, sample size, and testlet DTLF/single

item DIF indicates limitation to two models for comparison

to create an appropriate degree of complexity for this

study.

The level of response allowed by the various polytomous

methods was a deciding factor in choosing the methods for

this study. The lowest appropriate response level for a

testlet based method is ordinal level, because the possible

testlet responses are ordered. Again, using a lower

measurement scale classification results in wasted

statistical information and lower precision. The Mantel and

discriminant methods both analyze ordinal level responses.

Some of the logistic regression models consider

ordering of responses but were not chosen because they

require many separate model estimations and interpretation

is confusing (Swaminathan & Rogers, 1990). Miller and Spray

(1992) found in a simulation study that the continuation

ratio logit analysis method failed to flag nonuniform DIF

under certain conditions. Both the Mantel and discriminant

methods produce statistics that allow straightforward

interpretation of research results (Miller & Spray, 1992?

Miller & Spray, 1993).

67

The other polytomous methods discussed in the

literature review section, but not included in this study,

are limited to the nominal level response categories. Those

models are the generalized Cochran Mantel Haenszel (Zwick,

Donoghue, & Grima, 1993), other logistic regression

procedures (Agresti, 1990; Miller & Spray, 1993), t-test

procedures HW1 and HW3 (Welch & Hoover, 1993), and Bock's

nominal model (Wainer, Sireci, & Thissen, 1991).

The latent trait polytomous methods, mentioned in the

literature review, are not included in this study. IRT-

based procedures are "sensitive to sample size and model-

data fit and are expensive in terms of computer time"

(Swaminathan & Rogers, 1990). Future research efforts

should compare polytomous latent trait methods with observed

score methods for testlet based tests.

Testlet Design and Scoring

In the context of screening paper and pencil test

testlets for future use in testlet pools for adaptive

testing, only the linear structured testlets are considered.

In a CAT screening format it is possible to analyze testlets

with a hierarchical structure.

The number right was the testlet score used in this

study. Other testlet scoring strategies have been offered

in the literature. For example, Wainer and Kiely (1987)

68

discussed using the response pattern score for

hierarchically structured testlets.

Ability Score

Typically the total test score is used for conditioning

on the ability measure in observed score methods of

differential functioning analysis, as discussed in the

literature review. In this study, the total section score

was used as the ability measure in both procedures.

The flexibility of the logistic discriminant procedure

would allow other scores to be investigated as predictors of

group membership. For example, a separate test purported to

measure the overall ability of interest could be used as an

independent variable in the equation (Miller & Spray, 1993).

Conceivably, demographic variables could be used as

conditioning variables to predict group membership.

Thick and Thin Matching

As noted in the literature review, a study by Donoghue

and Allen (1993) offered seven different levels of matching,

from thin through various degrees of thickness to no

matching (as an extreme by which to compare the other

methods). This study used one type of thick matching (total

percentage matching) which is reported to be the most

appropriate for Mantel chi-square procedures. More types of

matching should be compared on studies using Mantel or

Mantel-Haenszel procedures.

69

Other Issues

' Many other questions arise from areas of this study or

from current literature. Just a few of the questions are

listed here.

1. In the context of adaptive testing, do various

paths differentiate between two subgroups?

2. What causes DIF or DTLF? Sometimes DIF items have

no logical reason to be flagged with differential

functioning.

3. What cognitive processes cause uniform and non-

uniform DTLF?

APPENDIX A

TESTLET NAMING CONVENTION

70

71

TESTLET NAMING CONVENTION

Common Passage Testlets f 8 'i : Reading Testlets: T1-T5 Math Testlets: T9-T10 Social Stu. Testlet: T26

Common Content Testlets C121: Math Testlets: T11-T15 Science Testlets: T19-T22 Social Stu. Testlets: T27-T29

Common Process Testlets f91: Reading Testlets: T6-T8 Math Testlets: T16-T18 Science: T23-T25

APPENDIX B

NELS:88 TESTLET ITEMS

72

73

NELS:88 TESTLET ITEMS

Reading Testlets: T1 = RE1 to RE5 T2 = RE6 to RE8 T3 = RE9 to RE14 T4 = RE15 to RE18 T5 = RE19 to RE21 T6 = RE1 to RE3, RE6 T7 = RE4, RE5, RE7, RE10 to RE14,

RE16 to RE21 T8 = RE8, RE9, RE15

common passage passage passage passage passage process rrepro-detail

process:inference/eval process:comprehension

Mathematics Testlets: T9 = MA2, MA3 passage T10 = MA6, MA7 passage Til = MAI, MA4, MA7, MA14, MA15, MA26,

MA27, MA29, MA34, MA39, MA40 content: T12 = MA 2, MA3, MA21, MA24 content: T13 = MA5, MA8, MA9, MA10, MAI2, MA13,

MA16 to MA20, MA22, MA23, MA28, MA30 to MA33, MA36

content:arithmetic T14 = MA6, MA35 content: T15 = MA11, MA25, MA37, MA38 content: T16 = MAI, MA3, MAS, MA6, MA8, MA9,

MA12, MAI3, MAI5 to MA19, MA22, MA25, MA34, MA40

T17 = MA2, MA4, MA7, MA10, MA11, MA14, MA20, MA21, MA24, MA26, MA27, MA29, MA31, MA32, MA33, MA36 to MA39 process:

T18 = MA23, MA28, MA30, MA35 process:

Science Testlets:

algebra data/prob

adv topics geometry

process:skill/know

und/comp prob solv

T19 = SCI, SC2, SC5, SC7, SC8, SC12, SC18 , SC21

T20 = SC3, SC10, SC11, SC14, SC19, SC20 , SC23

T21 = SC4, SC24 T22 = SC6, SC9, SCI3, SCI5 to SC17,

SC22 , SC25 T23 = SCI, SC4, SC13, SC14, SC20, SC22

SC23 , SC25 T24 = SC2 , SC5, SC6, SC8 to SC10, SC12

SC15 , SC18, SC19 T25 = SC3, SC7, SC16, SC17, SC21, SC24

content:earth sci

content:chemistry content:sci method

content:life sci

process:prob solv

process:decl know

74

Social Studies Testlets: T26 = SS5 to SS9 passage T27 = SSI, SS12, SS26 content:geography T28 = SS2, SS4, SS10, SS11, SS13, SS14,

SSI7, SS18, SS20, SS21, SS25, SS27, SS28, SS29 content:history

T29 = SS3, SS5 to SS9, SS15, SS16, SS19, SS22 to SS24, SS30 content:citizenship

APPENDIX C

CHI-SQUARE AND PROBABILITY VALUES FOR TESTLETS AND ITEMS

75

SUMMARY (N=500)

TESTLET 1 76

LDFA-DTLF (Non-Uniform)

LDFA-DTLF (Uniform)

Mantel-DTLF (Uniform)

Mantel-Haenszel-DIF (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq | Prob Sig Item Chi-Sq Prob Sig

0.137 0.7113 0.020 0.8875 0.073 0.7870 RE1 0.153 0.6958 RE2 0.760 0.3834

TESTLET 2

RE3 0.013 0.9110

TESTLET 2

RE4 0.001 0.9789

TESTLET 2

RE5 0.283 0.5948

TESTLET 2


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

5.217 0.0224 0.308 0.5789 0.244 0.6213 RE6 0.606 0.4365 RE7 0.301 0.5835 RE8 0.312 0.5766

TESTLET 3

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

1.255 0.26261 0.000 1.0000 0.002 0.9643 RE9 0.796 0.3724 RE10 0.574 0.4489 RE11 0.001 0.9761 RE12 2.518 0.1125 RE13 0.375 0.5405 RE14 0.818 0.3659

TESTLET 4


LDFA-DTLF (Uniform)


Man.tel-Haenszel-DIF (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob fSig Item Chi-Sq Prob Sig

1.702 0.1920 0.001 0.9748 0.004 0.9496| RE15 0.093 0.7601 RE16 0.034 0.8539

TESTLET 5

RE17 0.005 0.9453

TESTLET 5

RE18 0.204 0.6514

TESTLET 5


LDFA-DTLF (Uniform)


Mantel-Haenszel-D IF (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.083 0.7733 0.659| 0.4169 0.478 0.4893 RE19 0.226 0.6346

RE20 0.011 0.9153 RE21 0.249 0.6177

SUMMARY (N=500)

TESTLET 6 77


LDFA-DTLF (Uniform)

Mantei-DTLF (Uniform)

Mantel-Haensze!-DIF (Uniform)


2.516 0.1127 1.374 0.2411 1.691 0.1935 RE1 0.153 0.6957 RE2 0.760 0.3833

TESTLET 7

RE3 0.013 0.9092

TESTLET 7

RE6 0.606 0.4363

TESTLET 7


LDFA-DTLF (Uniform)




0.817 0.3661 0.329 0.5662 0.182 0.6697 RE4 0.001 0.9748 RES 0.283 0.5947

TESTLET 8

RE7 0.301 0.5833

TESTLET 8

RE10 0.574 0.4487

TESTLET 8

RE11 0.001 0.9748

TESTLET 8

RE12 2.518 0.1126

TESTLET 8

RE13 0.375 0.5403

TESTLET 8

RE14 0.818 0.3658

TESTLET 8

RE16 0.034 0.8537

TESTLET 8

RE17 0.005 0.9436

TESTLET 8

RE18 0.204 0.6515

TESTLET 8

RE19 0.226 0.6345

TESTLET 8

RE20 0.011 0.9165

TESTLET 8

RE21 0.249 0.6178

TESTLET 8


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig

2.3131 0.1283 0.1681 0.6819 0.209 0.6476 RE8 0.312 0.5765 RE9 0.796 0.3723 RE15 0.093 0.7604

TESTLET 9


LDFA-DTLF (Uniform)


Mantei-Haenszel-DiF (Uniform)

Chi-Sq | Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

1.148 0.2840 0.023 0.8795I 0.023 0.8795 MA2 6.468 0.0110 MA3 4.204 0.0403

SUMMARY (N=500)

TESTLET 10 78


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.145 0.7034 0.516 0.4726 0.279 0.5974 MA6 0.582 0.4455

MA7 0.002 0.9643

TESTLET 11


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq J Prob Sig Item Chi-Sq Prob Sig 0.171 0.6792 5.262 0.0218 5.085 0.0241 MA1 0.006 0.9383

MA4 0.000 1.0000 MA7 0.002 0.9643 MA14 0.158 0.6910 MA15 0.156 0.6929 MA26 0.188 0.6646 MA27 0.055 0.8146 MA29 2.713 0.0995 MA34 2.646 0.1038 MA39 0.876 0.3493 MA40 3.599 0.0578

TESTLET 12


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq |Prob Sig Item Chi-Sq Prob Sig 1.771 0.1833 1.305 0.2533 1.453 0.2280 MA2 6.468 0.0110

MA3 4.204 0.0403 MA21 3.257 0.0711 MA24 0.053 0.8179

SUMMARY (N=500)

TESTLET 13 79


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item iChi-Sq Prob Sig

0.420 0.5169 1.961 0.1614 1.023 0.3118 MA5 5.106 0.0238 MA8 0.011 0.9165 MA9 2.850 0.0914 MA10 1.605 0.2052 MA12 2.709 0.0998 MA13 0.295 0.5870 MA16 6.268 0.0123 MA17 0.099 0.7530 MA18 0.431 0.5115 MA19 0.328 0.5668 MA20 6.314 0.0120 MA22 0.371 0.5425 MA23 2.560 0.1096 MA28 5.008 0.0252 MA30 5.139 0.0234 MA31 0.011 0.9165 MA32 0.009 0.9244 MA33 0.081 0.7759 MA36 0.315 0.5746

TESTLET 14


LDFA-DTLF (Uniform)




1.465 0.2261 0.494 0.4821 0.346 0.5564 MA6 0.582 0.4455 MA35 0.001 0.9748

TESTLET 15


LDFA-DTLF (Uniform)




0.200 0.6547 15.346 0.0001 * 13.454| 0.0002 * MA11 2.521 0.1123 MA25 5.799 0.0160 MA37 3.794 0.0514 MA38 1.545 0.2139

SUMMARY (N=5Q0)

TESTLET 16 80


LDFA-DTLF (Uniform)


Mantel-Haensze!-DIF (Uniform)


MA3 4.204 0.0403

TESTLET 17

MA5 5.106 0.0238

TESTLET 17

MA6 0.582 0.4455

TESTLET 17

MA8 0.011 0.9165

TESTLET 17

MA9 2.850 0.0914

TESTLET 17

MA12 2.709 0.0998

TESTLET 17

MA13 0.295 0.5870

TESTLET 17

MA15 0.156 0.6929

TESTLET 17

MA16 6.268 0.0123

TESTLET 17

MA17 0.099 0.7530

TESTLET 17

MA18 0.431 0.5115

TESTLET 17

MA19 0.328 0.5668

TESTLET 17

MA22 0.371 0.5425

TESTLET 17

MA25 5.799 0.0160

TESTLET 17

MA34 2.646 0.1038

TESTLET 17

MA40 3.599 0.058

TESTLET 17


1.010 0.3149 0.772 0.3796 0.438 0.5081 MA2 6.468 0.0110 MA4 0.000 1.0000 MA7 0.002 0.9643 MA10 1.605 0.2052 MA11 2.521 0.1123 MA14 0.158 0.6910 MA20 6.314 0.0120 MA21 3.257 0.0711 MA24 0.053 0.8179 MA26 0.188 0.6646 MA27 0.055 0.8146 MA29 2.713 0.0995 MA31 0.011 0.9165 MA32 0.009 0.9244 MA33 0.081 0.7759 MA36 0.315 0.5746 MA37 3.794 0.0514 MA38 1.545 0.2139 MA39 0.876 0.3493

SUMMARY (N=500)

TESTLET 18 81


LDFA-DTLF (Uniform)



Chi-Sq |Prob |Sig Chi-Sq jProb Siq Chi-Sq Prob Sig Item | Chi-Sq Prob Sig 0.413 0.5205 2.488 0.1147 2.328 0.1271 MA23 2.560 0.1096

MA28 5.008 0.0252 MA30 5.139 0.0234 MA35 0.001 0.9748

TESTLET 19


LDFA-DTLF (Uniform)



Chi-Sq | Prob jSig Chi-Sq |Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 6.2771 0.0122) 8.570 0.0034 10.398 0.0013 SC1 0.257 0.6122

SC2 1.060 0.3032 SC5 0.065 0.7988 SC7 2.043 0.1529 SC8 3.866 0.0493 SC12 1.756 0.1851 SC18 0.198 0.6563 SC21 0.703 0.4018

TESTLET 20


LDFA-DTLF (Uniform)



Chi-Sq jProb |Sig Chi-Sq iProb Sig Chi-Sq Prob Sig Item Chi-ScL Prob Sig 1.459| 0.22711 0.565 0.4523 0.871 0.3507 SC3 1.065 0.3021

SC10 0.002 0.9643 SC11 0.029 0.8648 SC14 0.121 0.7280 SC19 1.641 0.2002 SC20 0.162 0.6873 SC23 0.157 0.6919

TESTLET 21


LDFA-DTLF (Uniform)



Chi-Sq |Prob |Sig Chi-Sq [Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 8.0131 0.00461 7.566 0.0059 6.821 0.0090 SC4 12.167 0.0005 *

SC24 0.004 0.9496

SUMMARY (N=500)

TESTLET 22 82


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.281 0.2577} 4.989 0.0255 2.673 0.1021 SC6 0.009 0.9244

SC9 0.628 0.4281 SC13 0.000 1.0000 SC15 0.279 0.5974 SC16 1.514 0.2185 SC17 0.004 0.9496 SC22 1.980 0.1594 SC25 0.050 0.8231

TESTLET 23


LDFA-DTLF (Uniform)


Mantel-Haenszel-DI F (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.995 0.1578 2.015 0.1558 1.341 0.2469 SC1 0.257 0.6122

TESTLET 24

SC4 12.167 0.0005 *

TESTLET 24

SC13 0.000 1.0000

TESTLET 24

SC14 0.121 0.7280

TESTLET 24

SC20 0.162 0.6873

TESTLET 24

SC22 1.980 0.1594

TESTLET 24

SC23 0.157 0.6919

TESTLET 24

SC25 0.050 0.8231

TESTLET 24


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob J Sig Chi-Sg^ Prob Sig Item Chi-Sq Prob Sig 3.332 0.0679 2.199 0.1381 3.848 0.0498 SC2 1.060 0.3032

SC5 0.065 0.7988 SC6 0.009 0.9244 SC8 3.866 0.0493 SC9 0.628 0.4281 SC10 0.002 0.9643 SC12 1.756 0.1851 SC15 0.279 0.5974 SC18 0.198 0.6563 SC19 1.641 0.2002

SUMMARY (N=500)

TESTLET 25 83


LDFA-DTLF (Uniform)




5.612| 0.0178 0.046 0.8302 0.036 0.8495 SC3 1.065 0.3021

TESTLET 26

SC7 2.043 0.1529

TESTLET 26

SC16 1.514 0.2185

TESTLET 26

SC17 0.004 0.9496

TESTLET 26

SC21 0.703 0.4018

TESTLET 26

SC24 0.004 0.950

TESTLET 26


LDFA-DTLF (Uniform)



Chi-Sq Prob |Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig

0.179 0.6722 1.537 0.2151 0.409 0.5225 SS5 1.123 0.2893 SS6 0.430 0.5120 SS7 0.052 0.8196 SS8 0.007 0.9333 SS9 0.063 0.8018

TESTLET 27


LDFA-DTLF (Uniform)




1.055 0.3044 0.354 0.5519 0.325 0.5686 SS1 1.742 0.1869 SS12 3.241 0.0718

TESTLET 28

SS26 4.003 0.0454

TESTLET 28


LDFA-DTLF (Uniform)



Chi-ScL Prob [Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

1.376 0.2408 5.384 0.0203 3.871 0.0491 SS2 3.221 0.0727 SS4 0.755 0.3849 SS10 2.629 0.1049 SS11 0.763 0.3824 SS13 0.994 0.3188 SS14 1.987 0.1587 SS17 0.214 0.6437 SS18 0.029 0.8648 SS20 0.119 0.7301 SS21 26.908 0.0000 *

SS25 1.120 0.2899 SS27 0.237 0.6264 SS28 4.326 0.0375 SS29 0.311 0.57711

SUMMARY (N=5C50)

TESTLET 29 84


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 2.856 0.0910 3.837 0.0501 1.811 0.1784 SS3 0.026 0.8719

SS5 1.123 0.2893 SS6 0.430 0.5120 SS7 0.052 0.8196 SS8 0.007 0.9333 SS9 0.063 0.8018 SS15 0.153 0.6957 SS16 0.007 0.9333 SS19 0.031 0.8602 SS22 0.480 0.4884 SS23 0.336 0.5621 SS24 3.834 0.0502 SS30 1.268 0.2601

p < .001

SUMMARY (N=100Q)

TESTLET 1 85


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig

0.027 0.8695 0.108 0.742 0.034 0.8537 RE1 0.993 0.3191 RE2 0.414 0.5201 RE3 0.484 0.4867 RE4 0.196 0.6582 RE5 0.508 0.4760

TESTLET 2

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF

(Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq | Prob Sig Chi-Sq | Prob Sig Chi-Sq Prob I Sig item Chi-Sq Prob Sig

6.741 0.0094 7.064 0.0079 5.864 0.0155 RE6 7.9010 0.0049 RE7 0.0000 1.0000 RE8 1.8898 0.1692

TESTLET 3

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF

(Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq | Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

6.2431 0.0125 0.075 0.7842 0.131 0.7174 RE9 1.360 0.2435 RE10 0.195 0.6585 RE11 0.760 0.3832 RE12 4.938 0.0263 RE13 4.398 0.0360 RE14 0.165 0.6842

TESTLET 4


LDFA-DTLF (Uniform)



Chi-Sq (Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig

3.342 0.0675 0.090 0.7642 0.236 0.6269 RE15 0.009 0.9236 RE16 0.313 0.5758

TESTLET 5

RE17 0.164 0.6858

TESTLET 5

RE18 0.319 0.5723

TESTLET 5


LDFA-DTLF (Uniform)




0.003 0.9563 3.199 0.0737 3.041 0.0812 RE19 0.386 0.5346 RE20 3.764 0.0524 RE21 0.069 0.7924

SUMMARY (N=1000)

TESTLET 6 86


LDFA-DTLF (Uniform)




4.662 0.0308 1.631 0.2016 1.900 0.1681 RE1 0.993 0.3191 RE2 0.414 0.0049

TESTLET 7

RE3 0.484 0.4866

TESTLET 7

RE6 7.901 0.0049

TESTLET 7


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.092 0.0431 1.208 0.2717 1.190 0.2754 RE4 0.196 0.6580

RE5 0.508 0.4760 RE7 0.000 1.0000 RE10 0.195 0.6585 RE11 0.760 0.3833 RE12 4.938 0.0263 RE13 4.398 0.0360 RE14 0.165 0.6846 RE16 0.313 0.5758 RE17 0.164 0.6855 RE18 0.319 0.5722 RE19 0.386 0.5344 RE20 3.764 0.0524 RE21 0.069 0.7928

TESTLET 8


LDFA-DTLF (Uniform)




2.494 0.1143 0.030 0.8625 0.000 0.9862 RE8 1.890 0.1692

TESTLET 9

RE9 1.360 0.2435

TESTLET 9

RE15 0.009 0.9244

TESTLET 9


LDFA-DTLF (Uniform)




MA3 1.806 0.1790

SUMMARY (N=1000)

TESTLET 10 87


LDFA-DTLF (Uniform)




MA7 0.679 0.4099

TESTLET 11


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig 1.187 0.2759| 1.994 0.1579 1.418 0.2337 MA1 1.420 0.2334


TESTLET 12


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.084 0.0433) 0.044 0.8339 0.099 0.7530 MA2 0.735 0.3914

MA3 1.806 0.1790 MA21 2.983 0.0841 MA24 0.117 0.7329

SUMMARY (N=1000)

TESTLET 13 88


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig 1.508 0.2194 4.067 0.0437 1.881 0.1702 MAS 5.022 0.0250

MA8 0.145 0.7038 MA9 1.767 0.1837 MA10 2.236 0.1348 MA12 6.136 0.0132 MA13 0.066 0.7967 MA16 1.481 0.2236 MA17 0.155 0.6941 MA18 3.532 0.0602 MA19 4.046 0.0443 MA20 13.339 0.0003 • *

MA22 0.388 0.5335 MA23 3.179 0.0746 MA28 8.339 0.0039 MA30 9.404 0.0022 MA31 0.120 0.7287 MA32 0.079 0.7782 MA33 0.001 0.9748 MA36 1.530 0.2161

TESTLET 14


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq [Prob Sig Item Chi-Sq Prob Sig 1.866 0.1719 2.329 0.1270 2.166 0.1411 MA6 1.635 0.2010

MA35 0.387 0.5339

TESTLET 15


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.031 0.8602 16.442 O

o

o

o

v?

15.496 0.0001 MA11 2.658 0.1030 MA25 13.298 0.0003 ' '• * - ̂

MA37 3.044 0.0810 MA38 0.742 0.3891

SUMMARY (N=1000)

TESTLET 16 89

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszet-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq | Prob Sig Item Chi-Sq Prob Sig

1.098 0.2947 0.110 0.7401 0.211 0.6457 MA1 1.420 0.2334 MA3 1.806 0.1790 MA5 5.022 0.0250 MA6 1.635 0.2010 MA8 0.145 0.7034 MA9 1.767 0.1837 MA12 6.136 0.0132 MA13 0.006 0.9362 MA15 3.266 0.0707 MA16 1.481 0.2236 MA17 0.155 0.6941 MA18 3.532 0.0602 MA19 4.046 0.0443 MA22 0.388 0.5335 MA25 13.298 0.0003 : Jijjjy; MA34 1.867 0.1718 MA40 0.059 0.8084

TESTLET 17


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-ScL Prob Sig Item Chi-Sq Prob Sig

2.114 0.1460 0.889 0.3457 0.893 0.3447 MA2 0.735 0.3914 MA4 0.331 0.5652 MA7 0.679 0.4099 MA10 2.236 0.1348 MA11 2.658 0.1030 MAI 4 0.393 0.5307 MA20 13.339 0.0003 r ;t MA21 2.983 0.0841 MA24 0.117 0.7329 MA26 0.047 0.8293 MA27 3.428 0.0641 MA29 0.871 0.3507 MA31 0.120 0.7287 MA32 0.079 0.7782 MA33 0.001 0.9748 MA36 1.530 0.2161 MA37 3.044 0.0810 MA38 0.742 0.3890 MA39 1.372 0.2415

SUMMARY (N=1000)

TESTLET 18 90


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig 2.262 0.1326 3.890 0.0486 3.183 0.0744! MA23 3.179 0.0746

MA28 8.339 0.0039

TESTLET 19

MA30 9.404 0.0022

TESTLET 19

MA35 0.387 0.5339

TESTLET 19


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 18.912 0.0000 * . 16.044 0.0001 * 16.048 0.0001 SC1 1.251 0.2634

SC2 1.425 0.2326 SC5 0.001 0.9748 SC7 4.245 0.0394 SC8 2.681 0.1016 SC12 3.357 0.0669 SC18 0.694 0.4048 SC21 1.359 0.2437

TESTLET 20


LDFA-DTLF (Uniform)




3.900 0.0483 1.801 0.1796 1.094 0.2956 SC3 5.699 0.0170 SC10 0.003 0.9563 SC11 0.681 0.4092 SC14 0.004 0.9496 SC19 5.372 0.0205 SC20 0.056 0.8136 SC23 1.308 0.2528

TESTLET 21


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 11.756 0.0006 11.025| 0.0009 11.50510.0007 SC4 17.640 0.0000

SC24 0.005 0.9436

SUMMARY (N=10Q0)

TESTLET 22 91


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.293 0.0383 13.271 0.0003 * 9.852) 0.0017 SC6 0.861 0.3535

SC9 0.001 0.9748 SC13 2.593 0.1073 SC15 10.161 0.0014 SC16 1.693 0.1932 SC17 0.001 0.9748 SC22 2.728 0.0986 SC25 0.806 0.3693

TESTLET 23


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.268

CO CO CO o b 2.436 0.1186 2.673 0.1020 SC1 1.251 0.2634

SC4 17.640 0.0000 " ~*r SC13 2.593 0.1073 SC14 0.004 0.9496 SC20 0.056 0.8129 SC22 2.728 0.0986 SC23 1.308 0.2528 SC25 0.806 0.3693

TESTLET 24


LDFA-DTLF (Uniform)


Ma ntel-Haenszel-DI F (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-ScL Prob Sig Item Chi-Sq^ Prob Sig 11.809 0.0006 1.719 0.1898 2.061 0.1511 SC2 1.425 0.2326

SC5 0.001 0.9748 SC6 0.861 0.3535 SC8 2.681 0.1016 SC9 0.001 0.9748 SC10 0.003 0.9563 SC12 3.357 0.0669 SC15 10.161 0.0014 SC18 0.694 0.4048 SC19 5.372 0.0205

SUMMARY (N=1000)

TESTLET 25


LDFA-DTLF (Uniform)

Mantel-DTLF (Uniform) Chi-Sq IProb |sig

Mantel-Haenszel-DIF (Uniform) " 'Chi-Sq

Chi-Sq IP rob |Sig 11,47810.00071 '

Chi-Sq|Prob [Sig Item Prob 0.0170

Sig

0.044I 0.8339T 0.0261 0.8721] SC3 SC7

5.699 4.245' 0.0394

SC16 1.693 0.1932 SC17 0.001 0.9748 SC21 0.005 0.9436 SC24 0.005 0.9436

TESTLET 26

LDFA-DTLF (Non-Uniform) Chi-Sq IProb [Sig

LDFA-DTLF (Uniform) Chi-Sq IProb |Sig

Mantel-DTLF (Uniform) Chi-Sq IProb Isig

Mantel-Haenszel-DIF (Uniform) Item I Chi-Sq I Prob Sig

4.460I 0.0347T 10.2771 0.00131" 7.0321 0.0080r 555 556

5.372 0.360

0.0205 0.5485

SS7 6.295 0.0121 SS8 0.385 0.5347 SS9 5.178 0.0229

TESTLET 27

LDFA-DTLF (Non-Uniform) Chi-Sa Prob I Sig

LDFA-DTLF (Uniform) Chi-Sq IProb Sig

Mantel-DTLF (Uniform) Chi-Sq IProb I Sig

Mantel (Unifor Item

-Haensz m) Chi-Sq

el-DIF

Prob o n A f l

Sig

5.052| 0.02461 1.1221 0.2895 1.984J 0.15901 SS1 SS12 SS26

1.072 12.486 4.630

Q.3UUO 0.0004 0.0314

TESTLET 28


LDFA-DTLF (Uniform)



l£Si^i34£^====!^= 5.044I 0.02471

Chi-Sq IProb I Sig 18.4421 0.00001 *

Chi-Sq IProb ISig I 7.7651 0.0053| \

Item 1 SS2

cni-sq | 1.399 0.2369

Sig

SS4 SS10

Q.2Z4 0.041

U.OuOU 0.8395

SS11 9.806 0.0017 SS13 1.446 0.2292 SS14 0.171 0.6792 SS17 0.025 0.8744 SS18 0.476 0.4902 SS20 1.556 0.2123 SS21 30.096 0.0000 SS25 4.742 0.0294 SS27 0.714 0.3982 SS28 1.892 0.1690 SS29 0.976 0.3233

SUMMARY (N=1000)

TESTLET 29 93


LDFA-DTLF (Uniform)





< .001

SUMMARY (N=2000)

TESTLET 1 94


LDFA-DTLF (Uniform)




RE2 0.001 0.9748 RE3 1.279 0.2580 RE4 3.937 0.0472 RE5 3.489 0.0618

TESTLET 2


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 12.647 0.0004 fr 26.546 0.0000 i * , i. 22.254 0.0000 RE6 14.995 0.0001 * .

RE7 1.630 0.2017 RE8 10.171 0.0014

TESTLET 3


LDFA-DTLF (Uniform)




RE10 0.524 0.4692 RE11 5.551 0.0185 RE12 14.701 0.0001 RE13 12.278 0.0005 RE14 0.831 0.3619

TESTLET 4


7.836 0.0051 2.614 0.1059 3.454 0.0631 RE15 3.323 0.0683 RE16 4.439 0.0351 RE17 0.147 0.7011 RE18 0.054 0.8159

TESTLET 5


0.001 0.9748 3.445 0.0634 3.986 0.0459 RE19 0.163 0.6866 RE20 7.800 0.0052 RE21 0.055 0.8146

SUMMARY (N=20Q0)

TESTLET 6 95


LDFA-DTLF (Uniform)




RE2 0.001 0.9748 RE3 1.279 0.2580 RE6 14.995 0.0001

TESTLET 7

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantei-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig I Item Chi-Sq Prob Sig

9.8 0.0017 1.415 0.2342 2.490 0.1146 RE4 3.937 0.0472 RE5 3.489 0.0618 RE7 1.630 0.2017 RE10 0.524 0.4691 RE11 5.551 0.0185 RE12 14.701 0.0001 RE13 12.278 0.0005 RE14 0.831 0.3620 RE16 4.439 0.0351 RE17 0.147 0.7014 RE18 0.054 0.8162 RE19 0.163 0.6864 RE20 7.8 0.0052 RE21 0.055 0.8146

TESTLET 8


7.854 0.0051 0.322 0.5704 0.054 0.8159 RE8 10.171 0.0014 RE9 0.868 0.3516 RE15 3.323 0.0683

TESTLET 9

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 11.815 0.0006 0.274 0.6007 0.513 0.4738 MA2 1.1247 0.2889

MA3 3.392 0.0655

SUMMARY (N=200Q)

TESTLET 10 96


LDFA-DTLF (Uniform)




MA7 2.171 0.1407

TESTLET 11

LDFA-DTLF (Non-Uniform) •

LDFA-DTLF (Uniform)





TESTLET 12


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 14.158 0.0002 'ii-S-l 0.146 0.7024 0.1371 0.7112 MA2 1.125 0.2889

MA3 3.392 0.0655 MA21 7.080 0.0078 MA24 0.614 0.4335

SUMMARY (N=2000)

TESTLET 13 97


LDFA-DTLF (Uniform)




MA8 0.111 0.7391 MA9 6.881 0.0087 MA10 2.955 0.0856 MA12 12.417 0.0004 *

MA13 2.433 0.1188 MA16 2.511 0.1131 MA17 0.191 0.6624 MA18 1.191 0.2752 MA19 10.372 0.0013 MA20 16.525 0.0000 *

MA22 1.016 0.3135 MA23 1.800 0.1797 MA28 21.591 0.0000

t MA30 9.772 0.0018 MA31 0.099 0.7528 MA32 0.001 0.9712 MA33 0.232 0.6304 MA36 0.119 0.7297

TESTLET 14


6.611 0.0101 3.698 0.0545 2.875 0.0900 MA6 0.0089 0.9248 MA35 6.156 0.0131

TESTLET 15


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 2.132 0.1443 33.237 0.0000 : '1 29.412 0.0000 4 # MA11 5.360 0.0206

MA25 15.162 0.0001 r MA37 8.462 0.0036 MA38 3.199 0.0737

SUMMARY (N=2000)

TESTLET 16 98


LDFA-DTLF (Uniform)




MA3 3.392 0.0655 MA5 6.746 0.0094 MA6 0.009 0.9248 MA8 0.111 0.7391 MA9 6.881 0.0087 MA12 12.417 0.0004 *


TESTLET 17


LDFA-DTLF (Uniform)




MA4 0.092 0.7619 MA7 2.171 0.1407 MA10 2.955 0.0856 MA11 5.360 0.0206 MA14 6.116 0.0134 MA20 16.525 0.0000 # '

MA21 7.080 0.0078 MA24 0.614 0.4335 MA26 0.674 0.4118 MA27 3.686 0.0549 MA29 0.036 0.8499 MA31 0.099 0.7528 MA32 0.001 0.9712 MA33 0.232 0.6304 MA36 0.119 0.7297 MA37 8.462 0.0036 MA38 3.199 0.0737 MA39 2.181 0.1397

SUMMARY (N=200G)

TESTLET 18 99


9.236 0.0024 4.456 0.0348 3.771 0.0521 MA23 1.800 0.1797 MA28 21.591 0.0000 MA30 9.772 0.0018 MA35 0.099 0.7528

TESTLET 19

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 45.656 0.0000 ' . 29.532 0.0000 ' J*.v 26.839 0.0000 SC1 0.092 0.7616

SC2 1.039 0.3081 SC5 0.263 0.6081 SC7 8.857 0.0029 SC8 6.878 0.0087 SC12 10.901 0.0010 SC18 4.127 0.0422 SC21 1.385 0.2393

TESTLET 20


LDFA-DTLF (Uniform)




SC10 0.014 0.9058 SC11 3.489 0.0618 SC14 0.465 0.4953 SC19 4.318 0.0377 SC20 0.075 0.7842 SC23 0.002 0.9643

TESTLET 21


LDFA-DTLF (Uniform)


Mantel-Haenszel-DiF (Uniform)


SC24 0.049 0.8248

SUMMARY (N=2000)

TESTLET 22 100


LDFA-DTLF (Uniform)


Mantel-Haenszel-DiF (Uniform)

Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 14.272 0.0002 *>' - 32.702 0.0000 27.437 0.0000 SC6 3.544 0.0598

SC9 0.003 0.9563 SC13 12.118 0.0005 , *-

SC15 14.477 0.0001 ; *

SC16 6.666 0.0098 SC17 0.174 0.6766 SC22 6.258 0.0124 SC25 0.216 0.6421

TESTLET 23


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 16.097 0.0001 /*• . 9.688 0.0019 10.159 0.0014 SC1 0.092 0.7616

SC4 16.941 0.0000 SC13 12.118 0.0005 , *

SC14 0.465 0.4953 SC20 0.075 0.7842 SC22 6.258 0.0124 SC23 0.002 0.9643 SC25 0.216 0.6421

TESTLET 24

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 26.643 0.0000 3.58 0.0585 3.287 0.0698 SC2 1.039 0.3081

SC5 0.263 0.6081 SC6 3.544 0.0598 SC8 6.878 0.0087 SC9 0.003 0.9563 SC10 0.014 0.9058 SC12 10.901 0.0010 - f-

SC15 14.477 0.0001 SC18 4.127 0.0422 SC19 4.318 0.0377

SUMMARY (N=2000)

TESTLET 25 101


LDFA-DTLF (Uniform)



Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 26.963 0.0000 0.092 0.7616 0 1.0000 SC3 5.343 0.0208

SC7 8.857 0.0029 SC16 6.666 0.0098 SC17 0.174 0.6766 SC21 1.385 0.2393 SC24 0.049 0.8248

TESTLET 26


LDFA-DTLF (Uniform)




SS6 3.613 0.0573 SS7 7.163 0.0074 SS8 1.877 0.1707 SS9 5.8246 0.0158

TESTLET 27

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 10.573 0.0011 3.293 0.0696 4.655 0.0310 SS1 2.621 0.1055

SS12 20.498 0.0000 SS26 5.296 0.0214

TESTLET 28

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 12.141 0.0005 35.608 0.0000 15.842 0.0001 SS2 0.181 0.6704

SS4 2.667 0.1025 SS10 0.213 0.6444 SS11 12.893 0.0003 SS13 6.154 0.0131 SS14 0 1.0000 SS17 1.78 0.1821 SS18 0.156 0.6929 SS20 0.111 0.7390 SS21 54.491 0.0000 SS25 3.486 0.0619 SS27 2.468 0.1162 SS28 10.486 0.0012 SS29 0.016 0.8993

SUMMARY (N=2000)

TESTLET 29 102

LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob. Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 17.793 0.0000 23.936 0.0000 18.161 0.0000 SS3 2.922 0.0874


p < .001

APPENDIX D

SPSS SAMPLE PROGRAM FOR LOGISTIC DISCRIMINANT FUNCTION ANALYSIS

103

Sample SPSS Program for LDFA 104

TITLE 'LDFA, SAMPLE=500, GROUP=GENDER' GET FILE=/SAMP500 SYS A' COMPUTE OBSCR=RESCORE IF GENDER=1 GROUP=l IF GENDER=2 GROUP=0 ADD VALUE LABELS GROUP 0 'FEMALE' 1 'MALE' ****** FIRST TESTLET COMPUTE TLTSCR=T1 COMPUTE INTERACT=OBSCR*TLTSCR *** CALCULATE MEANS **** COMPUTE TEMPVAR=1 AGGREGATE OUTFILE='TEMP MEAN A' /BREAK=TEMPVAR /MEANOBS=MEAN(OBSCR) /MEANTLT=MEAN(TLTSCR)

MATCH FILES FILE=* /TABLE='TEMP MEAN A' /BY TEMPVAR ******* CENTER THE VARIABLES TO REDUCE COLLINEARITY ******* BY SUBTRACTING THE MEAN COMPUTE CENOBS=OBSCR-MEANOBS COMPUTE CENTLT=TLTSCR-MEANTLT COMPUTE CENINT=CENOBS * CENTLT *** LOGISTIC DISCRIMINANT FUNCTION ANALYSIS LOGISTIC REGRESSION GROUP WITH CENOBS CENTLT CENINT /METHOD=ENTER CENOBS /METHOD=ENTER CENTLT /METHOD=ENTER CENINT

APPENDIX E

SAS SAMPLE PROGRAM FOR MANTEL SCORE TEST PROCEDURE

105

Sample SAS Program for Mantel Procedure 106

DATA ? INFILE 'TLT02OOO ASC A7; INPUT GROUP RESTRATA T1-T8 MASTRATA T9-T18

SCSTRATA T19-T25 SSSTRATA T26-T29; PROC FREQ; TABLES SSSTRATA * GROUP * T26 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T27 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T28 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T29 / NOPRINT CMH;

APPENDIX F

PASCAL SAMPLE PROGRAM FOR MANTEL-HAENSZEL PROCEDURE

107

Sample Pascal Program for 108 Mantel-Haenszel Procedure

Program MantelHaenszel (datafile,outfile,output);

* Calculates Mantel-Haenszel chi-square statistic for single-item reading DIF.

}

Uses Crt; const SampSize=500; {EDIT CONSTANTS AS NEEDED}

Dat='\procomm\[email protected]'; Out='manre.dat'; Subject='re'; MaxStrata=5; Maxltem=21;

Type FreqArrayType=Array [0..1,0..1,1..MaxStrata,1..MaxItem] of integer;

Var Datafile,Outfile : Text; Freq: FreqArrayType; i,j,k,s: Integer; {i=group; j=item score;

k=strata or total score; s=studied item)

Mantel, Denominator : Real; ( * * * * * * * * * * * * * * * * * * * )

Procedure GetData (Var Datafile,Outfile:Text; Var Freq:FreqArrayType);

Var dummy, i, j, k : Integer; Begin

For i:=0 to 1 do {Initialize Array} For j:=0 to 1 do

For k:=l to MaxStrata do For s:=l to Maxltem do

Freq[i,j,k,s]:=0; Assign (Datafile,dat); {Prepare Files} Assign (Outfile,out); Reset(Datafile); Rewrite(Outfile); While not eof (Datafile) do {Fetch Data}

Begin While not eoln (Datafile) do Begin

Read (Datafile,i,k); For s:=l to Maxltem do

Begin Read (Datafile, j); Inc(Freq[i,j,k,s]) ;

End; End; Readln (Datafile);

End; { ADD THIS SECTION FOR DEBUGGING

For s:=l to Maxltem do Begin

For k:=l to MaxStrata do

Begin 109

Writeln ('Strata',k:2); For i:=0 to 1 do Begin For j:=0 to 1 do

Write (Output,Freq[i,j,k,s]:4); Writeln;

End; Writeln;

End ; End;

}

End; ( * * * * * * * * * * * * * * * * * * * )

Function WtdFocalSum (k,s:Integer):Real; Var WFS: Real;

i,j: Integer; Begin

WFS:=0; If i=0 Then

For j:=0 to 1 do WFS:=WFS + ((j+1) * Freq[i,j,k,s]);

WtdFocalSum:=WFS; End;

( * * * * * * * * * * * * * * * * * * * )

Function ExpSum (k,s:Integer):Real; Var TS, FS, WMS: Real;

i,j: Integer; Begin

TS:=0; FS:=0; WMS:=0; For j:=0 to 1 do

FS:=FS + Freq[ 0, j , k, s ]; For i:=0 to 1 do

For j:=0 to 1 do TS:=TS + Freq[ i, j , k, s ];

For j:=0 to 1 do WMS:=WMS + ((j+1) * (Freq[0,j,k,s] + Freq[1,j,k,s]));

ExpSum:=FS/TS*WMS; End;

( ******************* ) Function VarSum (k,s:Integer):Real; Var RS,FS,TS,SWMS,WMS,VS: Real;

i,j: Integer; Begin

RS:=0; FS:=0; TS:=0; SWMS:=0; WMS:=0; VS:=0; For j:=0 to 1 do

Begin RS:=RS + Freq[0,j,k,s]; FS:=FS + Freq[l,j,k,s]; WMS:=WMS + ((j+1) * (Freq[0,j,k,s] + Freq[1,j,k,s]));

S W M S : = S W M S + ( ( j + 1 ) * (j + 1 ) *(Freq[0, j,k,s]+Freq[l,j,k,s]));

End; 110 TS:=FS + RS; VS:=(TS * SWMS) - (WMS * WMS); VarSum:= RS * FS / (TS * TS * (TS - 1)) * VS;

End; (*******************)

Begin {main} ClrScr; Writeln ('Input File: ',Dat); Writeln ('Output File: ',Out); GetData (Datafile,Outfile,Freq); For s:=l to Maxltem do Begin Mantel:=0.0; For k:=l to MaxStrata do Mantel:=Mantel + WtdFocalSum(k,s) - ExpSum(k,s);

Mantel:=Abs(Mantel) - 0.5; {correction for single item} Mantel:=Mantel * Mantel; Denominator:=0.0; For k:=l to MaxStrata do

Denominator:=Denominator + VarSum(k,s); Mantel:=Mantel / Denominator; Writeln; Write ('For Item':10,s:3); {Write Mantel to Screen} Write (': Mantel-Haenszel Chi Square: ':10); Writeln (Mantel:8:4); Write (OutFile.Subject^s-.l,',') ? {Write Mantel to File} Writeln (OutFile,Mantel:8:4);

End; Close (OutFile);

End. {main}

BIBLIOGRAPHY

Adema, J. J. (1991). The construction of customized two-stage tests. Journal of Educational Measurement. 27., 241-253.

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Aiken, S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: SAGE.

Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Belmont, CA: Wadsworth.

Angoff, W. H. (1972, September). A technigue for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service No. ED 069686)

Angoff, W. H. (1982). Use of difficulty and discrimination indices for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 96-116). Baltimore: Johns Hopkins University Press.

Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ: Lawrence Erlbaum Associates.

Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement. 10. 95-106.

Berk, R. A. (Ed.), (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika. 37. 29-51.

Brookshire, W. K. (1993). Differential item functioning in the National Education Longitudinal Study of 1988 test battery. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin, TX.

Ill

112

Camilli, G., & Shepard, L. A. (1987). The inadequacy of ANOVA for detecting test bias. Journal of Educational Statistics. 12.(1), 87-99.

Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational and Psychological Measurement. 28., 61-75.

Cole, N. S. (1993). History and development of DIF. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Crehan, K. D., Sireci, S. G., Haladyna, T. M., & Henderson, P. A. (1993). A comparison of testlet reliability for polytomous scoring methods. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.

Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart & Winston.

Dillon, G. F., Henzel, T. R., Klass, D. J., LaDuca, A., & Peskin, E. (1993). Presenting test items clustered around patient cases: Psychometric concerns and practical implications for a medical licensure program. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.

Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal of Educational Statistics. 18(2), 131-154.

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization.In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Dorans, N. J., & Kulik, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement. 23. 355-368.

Embretson, S. E. (Ed.), (1985). Test design: Developments in psychology and psychometrics. Orlando, FL: Academic Press.

Green, B. F. (1983). Adaptive testing by computer. In R.B. Ekstrom (Ed.), Measurement, technology, and individuality in education: New directions for testing and measurement No. 17. San Francisco: Jossey-Bass.

113

Green, B. F. (1988). Construct validity of computer-based tests. In H. Wainer & H. I. Braun (Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum Associates.

Haladyna, T. M. (1992). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5, 73-88.

Hambleton, R. K., Zaal, J. N. & Pieters, J. P. M. (1991). Computerized adaptive testing: Theory, applications, and standards. In R. K. Hambleton & J. N. Zaal (Eds.). Advances in educational and psychological testing: Theory and applications. Boston: Kluwer Academic Publishers.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp.129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.

Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.

Kim, H., & Plake, B. S. (1993). Monte carlo simulation comparison of two-stage testing and computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.

Kingsbury, G. G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive test. Applied Measurement in Education. 2, 359-375.

Kingsbury, G. G., Zara, A. R. & Houser, R. L. (1993). Procedures for using response latencies to identify unusual test performance in computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.

Lam, T. L., & Foong, Y. Y. (1991). Development and evaluation of hierarchical testlets in two-stage tests using integer linear programming. Paper presented at the Annual Meeting of AERA, Chicago, IL.

Lewis, C., & Sheehan, K. (1990). Using bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement. 14. 367-386.

114

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Lunz, M. E., & Stahl, J. A. (1993). Test targeting and precision before and after review on computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.

Mantel, N. (1963). Chi-square tests with one degree of freedom, extensions of the Mantel-Haenszel procedure. American Statistical Association Journal. 58, 690-700.

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute. 22. 719-748.

Marascuilo, L. A., & Slaughter, R. E. (1981). Statistical procedures for identifying possible sources of item bias based on chi-square statistics. Journal of Educational Measurement, 18. 229-248.

McArthur, D. L. (Ed.), (1989). Alternative approaches to the assessment of achievement. Boston: Kluwer Academic.

Mellenberg, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics. 7, 105-118.

Miller, T. R., & Spray, J. A. (1992). A comparison of three methods for identifying nonuniform DIF in polytomously scored test items. Paper presented at the Psychometric Society Meeting, Columbus, OH.

Miller, T. R., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal of Edu<-«^inna>^ Measurement. 30. 107-122.

National Center for Education Statistics (1990). User's manual: National education logitudinal study of 1988. Publication of U.S. Department of Education: Office of Educational Research and Improvement. (NCES 90-464)

115

National Center for Education Statistics (1991). Technical Report: Psychometric report for the NELS:88 base year test battery. Publication of U.S. Department of Education: Office of Educational Research and Improvement. (NCES 91-468)

Norusis, M. J. (1990). SPSS advanced statistics user's guide. Chicago: SPSS.

Osterlind, S. J. (1983). Test item bias. Beverly Hills: Sage.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests: Expanded edition. Chicago: University of Chicago Press.

Reshetar, R. A., Norcini, J. J., & Shea, J. A. (1993). A simulated comparison of two content balancing and maximum information item selection procedures for an adaptive certification examination. Paper presented at the Annual Meeting of AERA, Atlanta, GA.

Rosenbaum, P. R. (1988). Items bundles. Psychometrika. 53. 349-359.

SAS Institute (1990). SAS procedures guide, version 6 (3rd ed.}. Cary, NC: Author.

Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement. 16(3), 143-152.

Sheehan, K., & Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement. 16. 65-76.

Shepard, L., & Camilli, G. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational Statistics. 6(4), 317-375.

Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement r 28. 237-247.

Somes, G. W. (1986). The generalized Mantel-Haenszel statistic. The American Statistician. 40. 106-108.

SPSS Inc. (1990). SPSS reference guide. Chicago: Author.

116

Stahl, J. A., & Lunz, M. E. (1993). Assessing the extent of overlap of items among computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.

Steinberg, L., Thissen, D., & Wainer, H. (1990). Validity. In Wainer, H. (Ed.) Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.

Swaminathan, H., & Rogers, H. J. (1990). Detecting "differential item functioning using logistic regression procedures. Journal of Educational Measurement. 27, 361-370.

Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika. 49. 501-519.

Thissen, D. & Steinbern, L. (1986). a taxonomy of item response models. Psychometrika. 51. 567-577.

Thissen, D. , Steinberg, L. & Mooney, J. (1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement. 26, 247-260.

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum Associates.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory. Journal of Educational Measurement. 14, 182-196.

Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement. 26, 191-208.

Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practices. 15-20.

Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement. 24/ 185-201.

117

Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement. 27, 1-14.

Wainer, H., Kaplan, B., & Lewis, C. (1992). A comparison of the performance of simulated hierarchical and linear testlets. Journal of Educational Measurementf 29., 243-251.

Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement. 28. 197-219.

Wainer, H., Lewis, C., Kaplan, B., & Braswell, J. (1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement. 28, 311-324.

Wainer, H., Dorans, N., Flaugher, R., Green, B., Mislevy, R., Steinberg, L., & Thissen, D. (1990). Computerized adaptive testing; A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.

Walpole, R. E., & Myers, R. H. (1989). Probability and statistics for engineers and scientists: Fourth edition. New York: Macmillan.

Weiss, D. J. (Ed.) (1983). New horizons in testing. New York: Academic Press.

Weiss, D. J., & Yoes, M. E. (1991). Item response theory. In R.K. Hambleton & J.N. Zaal (Eds.), Advances in educational and psychological testing: Theory and applications. Boston: Kluwer Academic.

Welch, C., & Hoover, H. D. (1993). Procedures for extending item bias detection techniques to polytomously scored items. Applied Measurement in Education. 6(1),1-19.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: Mesa Press.

Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement. 30f 233-251.

37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing...

Documents

Transcript of 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing...