37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing...
Transcript of 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing...
![Page 1: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/1.jpg)
37? /VS/J
S*o. 297
THE GENERALIZATION OF THE LOGISTIC DISCRIMINANT
FUNCTION ANALYSIS AND MANTEL SCORE TEST
PROCEDURES TO DETECTION OF
DIFFERENTIAL TESTLET
FUNCTIONING
DISSERTATION
Presented to the Graduate Council of the
University of North Texas in Partial
Fulfillment of the Requirements
For the degree of
DOCTOR OF PHILOSOPHY
by
Mary E. Kinard, B.A., M.Ed.
Denton, Texas
August, 1994
![Page 2: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/2.jpg)
37? /VS/J
S*o. 297
THE GENERALIZATION OF THE LOGISTIC DISCRIMINANT
FUNCTION ANALYSIS AND MANTEL SCORE TEST
PROCEDURES TO DETECTION OF
DIFFERENTIAL TESTLET
FUNCTIONING
DISSERTATION
Presented to the Graduate Council of the
University of North Texas in Partial
Fulfillment of the Requirements
For the degree of
DOCTOR OF PHILOSOPHY
by
Mary E. Kinard, B.A., M.Ed.
Denton, Texas
August, 1994
![Page 3: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/3.jpg)
Kinard, Mary E., The Generalization of the Logistic
Discriminant Function Analysis and Mantel Score Test
Procedures to Detection of Differential Testlet Functioning.
Doctor of Philosophy (Educational Research), August, 1994,
117 pp., 13 tables, bibliography, 80 titles.
Two procedures for detection of differential item
functioning (DIF) for polytomous items were generalized to
detection of differential testlet functioning (DTLF). The
methods compared were the logistic discriminant function
analysis procedure for uniform and non-uniform DTLF (LDFA-U
and LDFA-N), and the Mantel score test procedure. Further
analysis included comparison of results of DTLF analysis
using the Mantel procedure with DIF analysis of individual
testlet items using the Mantel-Haenszel (MH) procedure.
Over 600 chi-squares were analyzed and compared for
rejection of null hypotheses.
Samples of 500, 1,000, and 2,000 were drawn by gender
subgroups from the NELS:88 data set, which contains
demographic and test data from over 25,000 eighth graders.
Three types of testlets (totalling 29) from the NELS:88 test
were analyzed for DTLF. The first type, the common passage
testlet, followed the conventional testlet definition:
items grouped together by a common reading passage, figure,
or graph. The other two types were based upon common
![Page 4: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/4.jpg)
content and common process. as outlined in the NELS test
specification.
Comparison of the LDFA-U and Mantel methods for null
hypothesis rejection yielded similar results, except in the
common content testlets. As expected, no pattern was
evident in comparisons of the LDFA-N to either of the
uniform detection methods. The number of testlets flagged
for DTLF increased as the sample size increased, and most
DTLF was indicated in the common content testlets.
In comparing item significance to corresponding testlet
significance, the situation was considered inconsistent if
the testlet and at least half of the items did not match in
rejection of the null hypothesis. Most inconsistencies
occurred in the common content type of testlets.
Further research was suggested in the following areas:
DIF and DTLF comparison, implicit testlets, polytomous
models, testlet design and scoring, thick and thin matching,
and DTLF post hoc procedures.
![Page 5: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/5.jpg)
TABLE OF CONTENTS
Page
LIST OF TABLES iv
Chapter
I. INTRODUCTION 1
Problem Addressed in the Study Significance of the Study Limitations Research Questions
II. REVIEW OF THE LITERATURE 7
Testlets in Computerized Adaptive Testing Testlet Structure Advantages of Testlets Other Testlet Issues Differential Item Functioning Differential Item Functioning Methods Uniform and Non-Uniform DIF Differential Testlet Functioning Polytomous Response DIF Polytomous Observed-Score Procedures Thick and Thin Matching
III. METHOD OF RESEARCH 32
Mantel Score Test Procedure Logistic Discriminant Function Analysis
Procedure NELS:88 Data Set
IV. PRESENTATION AND ANALYSIS OF DATA 44
Data Analysis Results Research Questions 1 through 4
V. FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS . . . 59
Findings Conclusions Recommendations
i n
![Page 6: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/6.jpg)
Page
APPENDIX 70
A. Testlet Naming Convention 70
B. Testlet Items 72
C. Chi-Square and Probability Values for Testlets and Items 75
D. SPSS Sample Program for Logistic Discriminant Function Analysis 103
E. SAS Sample Program for Mantel Score Test Procedure 105
F. Pascal Sample Program for Mantel-Haenszel Procedure 107
BIBLIOGRAPHY Ill
IV
![Page 7: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/7.jpg)
LIST OF TABLES
Table Page
1. Factors Varied in the Study 32
2. Frequencies: kth Level of Ability Variable . . 35
3. NELS:88 Testlet Types in Each Section 41
4. NELS:88 Number of Items in Testlets 42
5. Non-Matches in H0 Rejection for LDFA-U vs.
Mantel-U (p < .001) 49
6. Chi-Squares and Probabilities for Non-Matches. . 50
7. Non-Matches in H0 Rejection for LDFA-N vs. LDFA-U (p < .001) 51
8. Non-Matches in Hc Rejection for LDFA-N vs.
Mantel-U (E < .001) 52
9. Summary of Significant Chi-Squares (& < .001) . 53
10. Occurrences of Significant Chi-Squares
(E < .001) 54
11. Items Showing Significance at £ < .001 56
12. Inconsistencies Between Testlets and Associated Items 57
13. Ratio of Significant Items to Testlet Items in Flagged Cases 58
![Page 8: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/8.jpg)
CHAPTER 1
INTRODUCTION
The fundamental unit in test construction can be
thought of as a group of related items rather than a single
item. Items which are grouped together by a reading
passage, graph, or other common feature can be referred to
as a testlet. Current literature suggests a trend toward
tests which use the testlet rather than the item as the unit
of analysis (Dillon, Henzel, Klass, LaDuca, & Peskin, 1993;
Haladyna, 1992,* Sireci, Thissen, & Wainer, 1991; Wainer &
Kiely, 1987).
Adaptive testing is a specific area where testlets are
beneficial. Many of the psychometric problems associated
with computerized adaptive testing (CAT), such as context
effects and item ordering, can be alleviated by using the
testlet as the interchangeable unit (Wainer & Kiely, 1987;
Wainer & Lewis, 1990). As a result, the screening of
testlets to be included in a testlet pool requires
generalization of item screening techniques to testlet
screening techniques.
In particular, testing specialists must assure that
units (items or testlets) within the pool for adaptive
testing do not function differently for subgroups of
![Page 9: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/9.jpg)
examinees. In the case of items, each item must be
scrutinized for differential item functioning (DIF)
(Steinberg, Thissen, & Wainer, 1990). For example, when
males and females have been matched on the
ability being measured by the test, any item which shows
differential impact between genders is eliminated from the
item pool. Correspondingly, testlets must be examined for
differential testlet functioning (DTLF).
In this study some appropriate statistical methods
currently used for DIF are generalized to DTLF. The hope is
that test developers will use methods explored in this study
to ensure test fairness as more tests use the testlet as the
unit of analysis. Testlets may be screened for differential
functioning using data from fixed testlet-based tests.
Those testlets which do not show DTLF may be included in
future testlet pools for selection during testlet-based
adaptive testing.
Problem Addressed in the Study
Simply using DIF methodology for the detection of DTLF
is inappropriate. The items which are grouped because of
common subject matter, such as a reading passage, are not
independent of one another. There is an item dependency
within testlets which must be considered in psychometric
analyses. In other areas of psychometrics, such as
reliability studies, it has been shown that treatment of
![Page 10: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/10.jpg)
dependent items as independent ignores an important
statistical component and results in erroneous conclusions
(Sireci et al.( 1991).
In the past year, research regarding new methodologies
for detection of DIF in polytomously scored items has been
reported (Miller & Spray, 1993; Welch & Hoover, 1993; Zwick,
Donoghue, & Grima, 1993). A polytomous score has nominal or
ordinal level response capability, as opposed to dichotomous
scoring where the score has only two possibilities, correct
or incorrect. Most of the previous DIF research has been
focused upon dichotomous scoring only (Angoff, 1993; Berk,
1982; Osterlind, 1983). Generalization of polytomous DIF
methodologies to DTLF detection is the next logical step.
Only one research study (Wainer, Sireci, & Thissen,
1991) has been reported in the area of differential testlet
functioning (DTLF). The researchers adapted a method (Bock,
1972) which is potentially less powerful than some of the
other polytomous DIF procedures. The response level was
analyzed as nominal, thereby wasting information available
in the testlet scores. The testlet score used in analysis
may be considered ordinal or interval level.
There is a need for studies which analyze DTLF by
generalization of polytomous DIF methods which use more of
the information available in testlet scores. In this study
an attempt is made to compare two of the potentially most
powerful polytomous DIF procedures as generalized to
![Page 11: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/11.jpg)
exploration of DTLF. The methods compared are the logistic
discriminant function analysis and the Mantel score test
procedures.
Significance of the Study
Test fairness has become an important issue in
psychometrics in the past decade, both for test makers and
for test takers. Cole (1993) reported that this concern
began rapid growth in the 1960s, when opportunity for
equality in systems such as education and employment were
visibly questioned. The role of the testing community in
the equality effort is to be a neutral reporter of what it
finds.
Wainer summarized three major components of analysis to
assure test fairness: (a) Reviews of each item by subject
matter experts and demographic subgroup representatives; (b)
comparisons of validity characteristics for each major
subgroup; and (c) for each item, extensive statistical
analysis of relative performance of major demographic
subgroups (Wainer, 1993). This study concerns extensions of
the third component.
By exploring techniques to analyze relative performance
using two of the current major psychometric trends,
computerized adaptive testing and testlets, this study
contributes to the leading edge of both theory and
![Page 12: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/12.jpg)
application of psychometrics in the attempt to make the
testing process fair for every individual.
Limitations
This study was limited in complexity by the following
factors. Two polytomous observed score methods for DTLF
detection were used: the Mantel score test procedure and the
logistic discriminant function analysis procedure. Three
stratified random samples of 500, 1000, and 2000 were
compared, and were drawn by demographic subgroups from the
NELS:88 data set (National Center for Educational Statistics
[NCES], 1991). Linear explicit and implicit testlets from
the NELS:88 test were analyzed for differential functioning,
by whole testlet and by single items within each testlet.
Uniform and non-uniform DTLF were sought in the testlets in
the NELS:88 data set.
Research Questions
To fulfill the purpose of this study, the following
research questions were answered.
1. Do the Mantel score test of conditional
independence procedure and the logistic discriminant
function analysis procedure detect the same differential
testlet functioning in the same testlets?
2. Do the Mantel score test of conditional
independence procedure and the logistic discriminant
function analysis procedure detect both uniform and non-
![Page 13: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/13.jpg)
uniform differential testlet functioning to the same extent?
3. To what extent does variation in sample size
influence detection of differential testlet functioning?
4. How do the results of differential item
functioning differ from differential testlet functioning
when the Mantel score test of conditional independence
procedure is used for both analyses?
![Page 14: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/14.jpg)
CHAPTER 2
REVIEW OF THE LITERATURE
The fundamental unit in test construction can be
thought of as a group of related items rather than a single
item. Items which are grouped together by a reading
passage, graph, or other common feature can be referred to
as a testlet. Current literature suggests a trend toward
tests which use the testlet as the unit of analysis.
Wainer and Kiely (1987), who coined the term testlet,
suggested substituting the testlet for the single item as
the unit of analysis in test development. They defined a
testlet as "a group of items related to a single content
area that is developed as a unit and contains a fixed number
of predetermined paths that an examinee may follow" (p.
190).
Traditionally, the testlet format has been used for
reading tests where many items share a common reading
passage and for other tests such as mathematics or science
tests where a few items refer to the same diagram. However,
psychometric analyses have been focused upon only the item
as the unit of analysis until recently.
Indications are that future tests will be based upon
larger tasks in order to mirror real-world tasks more
![Page 15: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/15.jpg)
8
closely. Testlets provide a more realistic testing
situation for a world where tasks are interrelated rather
than separated. The testlet-based test is a better
representation of the performance being assessed (Sireci,
Thissen, & Wainer, 1991).
Recent school reform literature has been critical of
multiple choice items which tend to measure trivial recall
cognition levels. The testlet format is generally believed
to measure higher level thinking than a separate item format
(Haladyna, 1992).
Testlets in Computerized Adaptive Testing
One current use of testlet-based testing is in the area
of adaptive testing, particularly in increasing use of
computerized adaptive testing (CAT). Adaptive testing is
not new. In the early days of testing, a subjective and
expensive method of individualized testing was used. The
examiner made knowledgeable judgements about a test-taker's
proficiency level according to the person's responses to
items. With a skillful examiner, the focus of guestions was
at or near the person's ability level. This testing method
was one of the first types of adaptive tests (Wainer &
Kiely, 1987).
As computers have moved out of the basement and on to
the desk top, it is not surprising to find the testing
community taking advantage of the new capabilities
![Page 16: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/16.jpg)
available, particularly for adaptive testing. The basic
idea behind computerized adaptive testing is that each
examinee receives a customized set of items, geared directly
and accurately to the individual's proficiency level on the
trait that is being measured.
In the most general CAT method, four steps are
followed: (a) An item of medium difficulty chosen from a
large pool of items is presented first. (b) Depending on
whether the response is correct or incorrect, an initial
estimate of the ability and accuracy is calculated,
according to the algorithm that has been programmed.
(c) The next item is closer to the ability level of the
examinee. In general, more-difficult items are given after
correct responses and easier items are given after incorrect
responses. At each step, the item in the pool which gives
the most information about the person's ability is chosen
next. (d) The process continues until some stopping point
is reached. The stopping rule may be based upon a specified
level of accuracy (standard error), a maximum number of
items, or a maximum amount of time.
Item pools are put together by skilled test developers
who must satisfy such requirements as spanning a certain
difficulty range and making certain items are free of
differential functioning. The items are then calibrated by
using an item response model, and the item's estimated
parameters are tabled for use during the CAT (Wainer &
![Page 17: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/17.jpg)
10
Kiely, 1987). It follows that testlets must be screened for
inclusion in testlet pools for an adaptive test which is
testlet based rather than item based.
Adaptive tests typically require less time and fewer
items than do traditional tests. CAT can be a more accurate
assessment of proficiency level, particularly at the upper
and lower extremes of the ability scale (Hambleton, Zaal, &
Pieters, 1991).
Testlet Structure
The structure of testlets in a testlet based adaptive
test may be hierarchical, linear, or a mixture of
hierarchical and linear. A hierarchical branching structure
"routes examinees to successive items of greater or lesser
difficulty depending on their previous responses and
culminates in a series of ordered score categories" (Wainer
& Kiely, 1987, p. 190). In a linear structured testlet, all
examinees respond to all items, from the first to the last.
Depending upon the purpose, a test may be mixed by combining
hierarchical and linear testlets. For example, two
hierarchical tests being joined linearly are useful when
different content areas are included in the same test, and
each person begins at the same starting point within each
hierarchical testlet. Examples of mixed testlet designs are
described in Wainer and Lewis (1990).
![Page 18: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/18.jpg)
11
Two-Stage Testing
A routing testlet is a linear testlet, ordered by
difficulty level, which provides an initial estimate of an
individual's ability level. The examinee is then routed to
one of several second-stage tests, chosen as a function of
the estimated ability from the routing test. The second-
stage test may be either a linearly or a hierarchically
structured testlet. This design, called two-stage testing,
is a popular example of the mixed structure.
Advantages of Testlets
The current algorithmic methods used in CAT are not
without problems. Wainer and Kiely (1987) identified three
difficulties associated with CAT that are alleviated with
the use of testlet-based tests: context effects, item
ordering, and content balancing. Testlets add a
manageability factor in each of these areas.
Manageability
Because a limited number of paths exist, test
developers can more carefully scrutinize individual tests or
paths. Then problems which become evident can be more
easily corrected.
Context Effects
When one item affects a person's response on a
subsequent item, the items are not independent. For
![Page 19: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/19.jpg)
12
example, the earlier item may give a clue or answer to a
subsequent item. This is not a problem when all examinees
receive the same test. However, with adaptive testing, some
examinees could receive the second item without having the
advantage of the first item which contains the hint.
Certainly those who receive the first item have an unfair
advantage over those who do not. This is sometimes referred
to as a problem of cross-information.
In changing the unit of analysis from the item to the
testlet, the boundary effects are reduced. If a linear
testlet is used, only the first item in the testlet has an
unknown predecessor. Within the testlet, the developers
make certain that the problem of cross-information is
unlikely to occur.
Item Ordering
Traditionally, test items are ordered from easier to
harder items (referred to as power tests). In the power
test sequence, persons with lower ability levels are
encouraged by initial success and work harder on subsequent
items. However, in adaptive testing the item ordering
concept is different. Maximum efficiency in CAT is obtained
when the initial item is of medium difficulty and the
following items are chosen as a function of the person's
responses. Persons below the middle proficiency level do
not have the encouragement of early success.
![Page 20: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/20.jpg)
13
Starting points are controlled when testlet methodology
is used. When the item ordering within a linear testlet is
determined by a human developer, the ordering effect
problems are lessened. Because examinees with similar
ability levels take almost identical tests with a testlet
model, item order effects are relatively constant within
levels, and localize the effects. The scores of examinees
who are further apart on the ability continuum show less
confusion of relative ability estimates. (When the
hierarchical structure is used, however, the problem of
ordering still exists.)
Content Balancing
Test developers are traditionally careful to follow
formal content specifications in balancing content areas.
In addition, tests are scrutinized for informal content
imbalance; for example, word problems in a section could
refer to too many sports-related topics. In adaptive
testing, where all examinees do not have the same set of
items, it is more difficult to be certain that the content
is balanced, both formally and informally.
By using the testlet model, test developers can be more
certain that both formal and informal content specifications
have been followed. If a test violates the assumption of
unidimensionality because of the inclusion of separate
![Page 21: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/21.jpg)
14
content areas, then a series of hierarchical testlets solves
the problem of multi-dimensionality.
Testlet models have other advantages over the variable
branching models more commonly used in adaptive testing.
Some of the advantages are discussed in the following
section.
Independence Assumption
Testlet models allow the conditional independence
assumption to remain unviolated. There is independence
between testlets, but not within testlets (Rosenbaum, 1988).
In a recent study, Dillon, Henzel, Klass, LaDuca, &
Peskin (1993) used patient case clusters in a medical
licensure program to determine whether there was higher
intercorrelation between case-related items than random sets
of items. The correlations between items in a case cluster
were significantly higher than correlations between random
sets of items or sets matched on perceived content but from
different cases. The researchers concluded that there was a
high level of dependence within case clusters.
Review of Items
In an item-based CAT, the efficiency of the adaptive
algorithm is compromised if examinees are allowed to return
to items previously answered and revise their responses.
However, if linear structured testlets are used as the basis
of the CAT, examinees can be allowed to review items within
![Page 22: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/22.jpg)
15
the current testlet without compromising adaptive
efficiency. After the review, the next testlet is
adaptively chosen. This corresponds to a paper-and-pencil
test in which an examinee can review within one section, but
not after the next section is started (Wainer, 1993).
In short, testlets provide a middle ground between
traditional test theory and current adaptive testing
methodology. In traditional test theory, the unit of
analysis is the entire test (too big). In variable
branching adaptive testing, the unit of analysis is the
single item (too little). In testlet models, the unit of
analysis is a bundle of items (just right).
Other Testlet Issues
Testlet Reliability
Reliability of testlet-based tests is overestimated
when item-based methods are used to compute reliability.
Thissen, Steinberg and Mooney (1989) analyzed the responses
of 3,866 examinees on a reading comprehension test. The
test was made up of 22 items, divided unequally among 4
passages. Clearly, the item-based reliability estimates
(0.86 to 0.88) are much higher than the testlet-based
estimates (0.76 to 0.80). Each testlet-based reliability is
0.08-0.12 lower than the corresponding item-based estimate.
For the same example, the Spearman-Brown formula was
used to estimate how many testlets would need to be added to
![Page 23: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/23.jpg)
16
make up the difference in reliability. It was found that
the test length had to be doubled to increase reliability to
a comparable testlet-based reliability of 0.87. Of course,
the testlet-based reliability is more appropriate and far
more accurate than item-based reliability. As more testlet-
based tests are used it will be important to use appropriate
statistical methods for estimating reliability (Sireci et
al., 1991).
Polytomous Models
Previous examples have used dichotomous models, where
each test item is scored either right or wrong. But in
current item-analysis research, several polytomous models
are being tested.
A comparison of some dichotomous models and a
polytomous testlet model was made in the context of
investigating testlet reliabilities (Crehan, Sireci,
Haladyna & Henderson, 1993). One finding of the study
reinforced previous research in that testlet reliabilities
were lower than single-item reliabilities. But the most
significant finding involved comparison of polytomous to
dichotomous test information functions. The polytomous
model resulted in providing much more information at the
lower end of the proficiency scale. This is particularly
significant for cut-score certification decisions, where
![Page 24: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/24.jpg)
17
many decisions are made at the precision level which showed
the most differentiation.
Differential Item Functioning
One of the most prominent current issues in
psychometrics is test fairness. Test developers are asked
to assure that test items function equally for all examinees
of the same proficiency level, regardless of group
membership. The phrase differential item functioningr or
DIF, refers to the study of that functionality.
If each test item in a test had exactly the same item response function in every group, then people at any given level 9 of ability or skill would have exactly the same chance of getting the item right, regardless of their group membership. Such a test would be completely unbiased. This remains true even though some groups may have a lower mean 0, and thus lower test scores, than another group. In such a case, the test results would be reflecting an actual group difference and not item bias. (Lord, 1980, p. 212)
Originally called item bias, the more neutral term of
differential item functioning better describes the concept.
If an item performs differently for two groups, it does not
necessarily mean it is showing prejudice against one of the
groups, a connotation which often arises from the term bias.
It may simply mean that different traits are being measured.
DIF focuses upon statistical properties of a set of test
responses, with the idea of having a unidimensional test, or
that each item measures the same trait or ability for all
examinees.
![Page 25: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/25.jpg)
18
DIF is a relative term. An item may perform differently for one group of examinees relative to the way it performs for another group of examinees. The examinee group of interest is the FOCAL group, and the group to which its performance on the item is being compared is the REFERENCE group. In general, there will be several FOCAL/REFERENCE pairs of groups for which DIF analysis can be made. (Holland & Wainer, 1993, p. xiv)
If the difference in performance on an item is measured
between unmatched groups, the result is not DIF, but instead
is a measure of impact. Impact has been defined as "the
difference between the focal group and the reference group
of the probability of getting the studied item correct"
(Wainer, 1993, p. 134).
Ordinarily, all the items on a test are examined for
DIF, one at a time, with the current item of interest being
called the studied item. One of the basic underlying
concepts of DIF methodologies is that examinees of equal
ability are being compared on responses to an item. The
criteria used to match individuals between groups is usually
the total test score, which is assumed to be the most
accurate measure of the trait or ability being measured by
the item being studied.
Differential Item Functioning Methods
Many statistically rigorous and efficient procedures
have been developed and used in the past few years for the
detection of DIF. In this section, some of the more common
methods for detection of single-item DIF for dichotomously
![Page 26: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/26.jpg)
19
(right/wrong) scored items are summarized. Each method
falls into one of two categories: those based upon observed
score and those based upon latent trait.
Observed Score Methods
One of the first observed score methods, developed by
Cleary and Hilton in 1968, was the analysis of variance
(ANOVA) procedure, which uses interaction between item and
group to flag the presence of DIF. Although the ANOVA
method is easy to use and understand, it requires large
sample sizes and may be inaccurate when groups differ in
achievement level (Camilli & Shepard, 1987).
In the 1970s, a method called delta-plot or transformed
item-difficulty (TID) was developed by Angoff (1972). The
delta-plot method describes a set of items as unbiased if
the item difficulty values (^-values) for each group are
perfectly correlated. (P-values are defined as the
percentage of examinees answering an item correctly.) Using
classical test theory, the ^-values for each group are
calculated and transformed into deltas, using a mean of 13
and a standard deviation of 4. A 45-degree ellipse is
fitted to the bivariate graph of the pairs of deltas (one
point for each item), revealing DIF items as outliers. The
distance between an item and the major axis of the ellipse
indicates the amount of DIF. Although delta-plot analyses
![Page 27: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/27.jpg)
20
are simple and inexpensive, highly discriminating items may
be flagged falsely as DIF items (Angoff, 1982).
A currently popular observed-score method which is
statistically powerful, easily understood, and inexpensive
to compute is the Mantel-Haenszel (MH) procedure. This
procedure was originally used in a study of disease by
Mantel and Haenszel (1959) and applied to DIF analysis by
Holland and Thayer (1988).
The MH procedure divides the examinees into several
intervals which are normally based upon the total test
score. The focal and reference groups are considered to be
matched on the ability most relevant to the ability measured
by the studied item. For each interval, a 2 x 2
contingency table is formed which shows frequencies of
correct and incorrect items for the focal and reference
groups. The ratio of the odds that the reference group
answered correctly to the odds that the focal group answered
correctly is calculated for each interval. The procedure
then estimates a common odds ratio across all matched
categories. The MH statistic (with a continuity correction)
is distributed approximately as a chi-square statistic with
one degree of freedom (Dorans & Holland, 1993).
The standardization method (Dorans & Kulik, 1986), is
similar to the Mantel-Haenszel procedure. However,
standardization uses differences between the p-values of the
groups at each interval, and it applies weights to the p-
![Page 28: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/28.jpg)
21
value differences for each interval (Angoff, 1993).
Although large sample sizes are required, standardization is
a "flexible, easily understood descriptive procedure that is
particularly suited for assessing plausible and implausible
explanations of DIF", according to Dorans and Holland (1993,
p. 38) .
The MH and standardization approaches are both based
upon earlier chi-square procedures offered by
Scheuneman(1979), and modified by Marascuilo and Slaughter
(1981), and by Shepard and Camilli (1981). The major
improvement of MH and standardization over previous chi-
square procedures was in providing a measure of the amount
of DIF.
Mellenberg (1982) described a chi-square procedure
using loalinear and loait models for contingency tables.
Unlike the Mantel-Haenszel and other chi-square methods, the
loglinear/logit procedure is able to make a distinction
between uniform and non-uniform DIF, which are discussed
later.
In 1990, Swaminathan and Rogers suggested the
application of logistic regression analysis for the
detection of DIF. The probability of a correct response is
given by a logistic formula which uses a regression equation
as the exponent of e. The coefficients in the regression
equation, estimated with the maximum likelihood method, are
used as indicators of DIF. Like Mellenberg's (1982) method,
![Page 29: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/29.jpg)
22
the logistic regression procedure is able to detect non-
uniform DIF. As a further improvement, it does not break
the continuous ability parameter into intervals. Treating
ability as continuous rather than categorical results in a
more powerful method.
Latent Trait Methods
The item response theory (IRT) model is the foundation
for latent trait methods. The three-parameter logistic (3-
PL~) model is comprehensive, addressing the differences
between groups with respect to item difficulty,
discrimination, and guessing. Other methods ignore DIF with
respect to guessing and discrimination, making the three-
parameter IRT method the theoretically preferred method,
over observed score methods and other latent trait methods.
Its use is inhibited, however, because of the requirements
for large sample sizes, special computer programs, and
costly run times, as well as the complexity of conceptual
understanding and the difficulty in meeting assumptions.
Item response theory is modeled by an s-shaped item
characteristic curve (ICC), where the abscissa represents
the latent ability continuum and the ordinate shows the
probability of answering the item correctly. Each of the
three parameters is represented visually. The point of
inflection lies directly above the ability level equal to
item difficulty b; the slope at the point of inflection is
![Page 30: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/30.jpg)
23
proportional to discrimination a; and the lower asymptote
represents the probability c that an examinee with no
ability will correctly guess the answer.
If the ICCs for the two groups differ, it is assumed
that the item contains DIF. The most common measures of the
magnitude of DIF used with the three-parameter method are
the area between the curves and the tests of equality of the
three parameters across the groups.
Another latent trait model, the Rasch model (similar to
the one-parameter IRT model) considers only the difficulty
parameter. The discrimination parameter is set to a
constant, implying that all items discriminate equally. The
Rasch model is less complex, less expensive to run, and does
not require large sample sizes.
Uniform and Non-Uniform DIF
A major goal of modern test theory is
unidimensionality, or having all items on the test measure
only one trait. It has been hypothesized that DIF occurs
when the item is measuring one or more secondary traits for
one of the groups. An example is an item on a mathematics
test that inadvertently measures an ancillary trait of a
verbal nature (Mellenberg, 1982; Swaminathan & Rogers,
1990).
When two groups differ consistently on the primary and
secondary traits, uniform DIF occurs. However, if the
![Page 31: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/31.jpg)
24
abilities are inconsistent between the two traits, non-
uniform DIF is present. As Mellenberg (1982) described the
distinction, uniform DIF "means that the group difference in
the second trait is constant across the main trait.
Nonuniform (DIF) implies that the group difference in the
additional ability depends on the main ability" (p. 115).
In IRT terms, non-uniform DIF is evident when the trace
lines (or ICCs) are not parallel. A further distinction can
be made between ordinal and disordinal non-uniform DIF. The
ICCs cross in the middle of the curves in the disordinal
case. If, however, the lines cross at the lower or higher
end of the ability continuum, or even past either end, then
ordinal non-uniform DIF is indicated.
A major disadvantage of the MH procedure compared to
the logistic based chi-square procedures is that MH is less
powerful in the detection of non-uniform DIF. However, the
distinction is lessened in the case of ordinal non-uniform
DIF (Swaminathan & Rogers, 1990).
Differential Testlet Functioning
As more tests are developed with the testlet structure,
it becomes increasingly important to investigate
differential functioning using the testlet rather than the
item as the unit of analysis. An essential component in
testlet-based computerized adaptive testing is a testlet
pool containing DIF-free testlets. Testlets must be
![Page 32: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/32.jpg)
25
screened for differential functioning before being approved
for the testlet pool.
Having only DIF-free items in a testlet does not
necessarily mean that the testlet as a whole is free of
differential functioning, called differential testlet
functioning or DTLF. Only one research study regarding DTLF
has been reported.
The DTLF Study
Attempts to define differential testlet functioning and
to derive a statistical method to detect it were made by
Wainer, Sireci and Thissen in 1991. The nominal response
model developed by Bock (1972) was used by the researchers.
First, the model was fit to both the reference and focal
populations assuming there was no DTLF. Then the same model
was fit allowing DTLF to exist. If there was not a
significant difference between the two models, it was
assumed that no DTLF existed.
Wainer et al. (1991) discussed three advantages of
analyzing testlets for DTLF over simply using individual
item analysis methods: (a) The model for analysis matches
the manner in which the test is constructed. If items are
to be administered as a unit, then the items should be
analyzed that way. (b) Consideration of an aggregate
measure of DIF in testlet-structured tests allows small
amounts of item DIF to cancel within the testlet. It should
![Page 33: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/33.jpg)
26
be emphasized that cancel out means something quite
specific. It means that there will be no DIF at every score
level within the testlet. (c) Applying DIF analysis at the
testlet level may uncover some DIF that was not evident at
the item level. "The increased statistical power of dealing
with DIF at the testlet level provides us with another tool
to ensure fairness" (Wainer et al., 1991, p. 199).
PolYtomous Response DIF
In the past few decades, most DIF research has been
focused upon dichotomous item scoring, where the item is
simply marked right or wrong. Recently, however, more
studies involving polytomous scoring of items have appeared,
particularly because of a new emphasis on performance
assessments. In polytomous scoring, an item is given a
number-correct score or is classified as one of several
unordered choices rather than a right/wrong score. Miller
and Spray (1993, p. 107) defined polytomous responses as
"item responses which are scored on a nominal or ordinal
scale and which consist of more than two categories."
Statistical methods offered for polytomous DIF analysis
range from entirely new methods to modifications of
dichotomous DIF procedures.
The polytomous model used by Wainer et al. (1991) in
the DTLF study, called Bock's nominal model, was developed
for scoring nominal level categorical responses by Bock
![Page 34: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/34.jpg)
27
(1972). The testlet score was simply the number of items
correct in the testlet, ranging from zero to the maximum
number of items. Wainer et al. expressed an interest in
expanding the information provided in the testlet score by
using possible patterns of responses instead of number-
correct, but the raw score was chosen as a simpler starting
point, because the area of DTLF research is just beginning
to develop (Thissen et al., 1989).
In an effort to limit the scope of this study, only
observed score models were considered for comparison. A
discussion of multiple-category latent trait models was
offered by Thissen and Steinberg (1986), who organized the
methods into a proposed taxonomy.
Polytomous Observed-Score Procedures
Recently developed polytomous DIF models, which have
been applied to performance assessment DIF analysis, can be
applied directly to DTLF research, as in the Wainer et al.
(1991) study with Bock's (1972) model. The testlet score
can be used in place of the performance assessment score.
Mantel and GCMH
Two polytomous extensions of the dichotomous MH method
were explored for analysis of DIF in performance assessments
by Zwick et al. (1993). First, the Mantel Score Test of
Conditional Independence proposed by Mantel (1963) takes
into account the ordering of responses. The accompanying
![Page 35: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/35.jpg)
28
statistic has an approximate chi-squared distribution with
one degree of freedom. Second, the Generalized Cochran-
Mantel-Haenszel statistic (Mantel & Haenszel, 1959; Somes,
1986), termed CGMH, is a multivariate generalization of the
dichotomous MH. The CGMH considers responses at the nominal
level only (Agresti, 1990, pp. 234-235, 283-284). Again,
the MH-based procedures do not have the power to detect non-
uniform DIF, and both require that the ability parameter be
treated as categorical and unordered.
Logistic Regression
Three polytomous adaptations of the logistic regression
procedure for dichotomous items have been proposed by
Agresti (1990). Each treats response categories as ordered,
as in the Mantel Score Test, rather than nominal, as in the
Bock and CGMH models. The adaptations are complex and
require a separate model estimation for each ordered
category (minus one), which makes interpretation of results
difficult. However, two characteristics which make logistic
regression a powerful choice are ability to detect non-
uniform DTLF and treating ability as continuous rather than
categorical (Agresti, 1990; Hosmer & Lemeshow, 1989; Miller
& Spray, 1993).
Logistic Discriminant Function Analysis
A proposed polytomous method which has the powerful
advantages of logistic regression without the complexity is
![Page 36: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/36.jpg)
29
logistic discriminant function analysis. Again, the ability
variable is treated as continuous, and non-uniform
differential functioning is detectable. The difference is
in the choice of the dichotomous dependent variable.
Whereas logistic regression uses item response (dichotomous)
as the dependent variable, logistic discriminant analysis
uses group (dichotomous). The discriminant procedure
requires only one regression equation per testlet, because
the testlet response is an independent varicible. Other
independent variables are ability score (the matching
variable) and ability-by-response interaction. The
inclusion of an interaction term is used to flag non-uniform
DTLF. The discriminant procedure is very flexible and
allows other independent variables, such as an external
matching variable (Miller & Spray, 1993). Because it is a
logistic procedure, the assumptions of multivariate
normality and equal variance-covariance matrices are not
required, as they are in linear discriminant analysis
(Norusis, 1990).
T-Test
Another set of polytomous methods are the combined it-
test statistics (called HW1 and HW3), proposed by Welch and
Hoover (1993). As in the MH-based procedures, an ability
score is divided into categories. Based upon an assumption
of homogeneity of variances of scores at each ability level,
![Page 37: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/37.jpg)
30
the HW1 statistic tests the difference between the means of
the focal and reference groups summed across the levels.
The HW3 procedure differs in use of a weighting procedure to
balance unequal sample sizes at each ability level. When,
in a simulated study, the Mantel Score Test was used for
comparison, the HW3 statistic appeared to control Type I
errors as well as the Mantel, but demonstrated more power in
identifying DIF items.
Thick and Thin Matching
In the MH and other chi-square procedures, there is
some question about using each possible score to stratify
the ability continuum, called thin matching. (For example,
if the possible scores range from 0 to 30, then there are 31
ability levels.) In MH procedures some data can be wasted
by overly fine matching, because any row or column with a
0 frequency cell is eliminated in calculations.
Creating fewer levels by pooling test scores, termed
thick matching, was shown to be a more accurate predictor of
DIF when the MH chi-square statistic was used, in a recent
study (Donoghue & Allen, 1993). The results may be
generalized to the Mantel chi-square statistic, because a
dichotomous use of the Mantel reduces to the MH statistic
without the continuity correction (Zwick et al., 1993).
In the Donoghue and Allen simulation study the
researchers compared different degrees of pooling, or
![Page 38: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/38.jpg)
31
thickness of matching. One method, the total percentage
matching strategy, was the most effective for the MH chi-
square procedure. In total percentage matching, similar
numbers of examinees are allocated to each pooled level of
the total test score. For example, for five levels, score
intervals are combined to approximate quintiLes of the
combined sample or the focal group sample.
![Page 39: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/39.jpg)
CHAPTER 3
METHOD OF RESEARCH
The design of this study called for variation of five
major factors, as shown in Table 1. The first variable
factor was the DTLF detection model, with two polytomous
models chosen. Next, the issue of uniform and non-uniform
DTLF was explored by comparing models within the logistic
Table 1
Factors Varied in the Study
Factors Number Variations
DTLF detection model 2 Mantel score test model Logistic discriminant
analysis model
DTLF/DIF consistency between groups 2 Uniform
Non-Uniform
Class of testlets 3 Explicit: Common passage Implicit: Common content Implicit: Common process
Unit of analysis 2 Single items within a testiest
Testlet
Sample size 3 500 1000 2000
32
![Page 40: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/40.jpg)
33
discriminant procedure. Another factor was the strategy for
grouping items to form testlets. Also, single items within
each testlet were analyzed for DIF and the results were
compared with overall testlet DTLF. Finally, three sample
sizes were compared. The chi-square statistics and
associated probabilities were reported for each possible
combination of the factors.
Mantel Score Test Procedure
A polytomous extension of the dichotomous Mantel-
Haenszel (MH) procedure (Mantel & Haenszel, 1959) was
proposed in 1963 by Mantel. The test of association between
groups matched on a conditioning variable weis developed for
the case of ordinal categories. In the case of testlets,
each testlet on the test is individually scrutinized for
DTLF using the Mantel score statistic. The testlet being
investigated is termed the studied testlet.
As in the other chi-square based procedures, the
combined sample is divided into stratifications based upon a
conditioning variable assumed to represent overall ability
in the trait being measured. Then the focal and reference
groups are considered to be matched on ability. Most often
the conditioning variable is the total test score, although
an external criterion is sometimes used. Typically, the
sample is broken into ability groups based upon scores on
![Page 41: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/41.jpg)
34
the conditioning variable, with the number of strata being
the number of possible scores. Of course, there are other
stratification options, as explained in the literature
review section in Chapter 2 (Donoghue & Allen, 1993).
The score for each testlet is the number of items
answered correctly. If the testlet has g items, then there
are (g + l) possible response scores, allowing for the
possibility of no answers correct. An index (in this
example J) which ranges from 1 to (g + 1) is used, so that
the zero category does not have zero weight. Weighting by
score index is used to account for ordinal scores in the
Mantel procedure.
Frequencies on a studied testlet are organized into a
2 x J x K contingency table, where J represents ordered
response categories, and K is the stratification (ability)
level. The 2 x J portion for the kth stratification level
is illustrated in Table 2. Cell frequencies (n for number
of subjects) for a subset of examinees who are considered to
be matched on the overall ability of interest are also shown
in Table 2. The plus sign (+) indicates summation across a
row or column. There is a 2 x J table at each of the K
ability levels for the studied testlet.
Ordering of response categories is taken into account
by assigning weights to the focal group frequency in each
category, according to an ordered index for that category.
![Page 42: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/42.jpg)
35
Table 2
Frequencies: kth Level of Ability Variable
Ordered Index to Testlet Score
Group Yx y2 Y 3 . . . Yj Total
Focal F̂2K F̂3k • • F̂jk F̂+k
Reference R̂lk R̂3k nEjk R̂+k
Total n+lk n+2k n+3k • • ri+jk n++k
The summary chi-square proposed by Mantel (1963), with one
degree of freedom, is
( Z F * - £ * w > Mantel %2 = —* -k - (i)
where Fk is the weighted focal group frequency, defined as
jnFjk ' (2 )
where y-j is the ordered index to the testlet score. The
expectation of Fk is
![Page 43: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/43.jpg)
36
E (Fk) = Ikli^y.n+Jk , (3) n++k j
and the variance of Fk is
Var(Fk) = [(n„1£ytn,jk)-(£yjn,jk)2] . (4)
fl-++k(fi++k 1' - J
The Mantel statistic follows a chi-square distribution
with one degree of freedom. The null hypothesis states that
the ratio of the odds of answering the item correctly for
the reference group to that of the focal group is one. A
rejection of the H0 suggests that the focal and reference
groups differ in performance on the studied item even when
matched on ability. In other words, a large chi-square with
a small probability flags a testlet as potentially
containing DTLF (Agresti, 1990; Mantel, 1963; Welch &
Hoover, 1993; Zwick, Donoghue, & Grima, 1993).
Logistic Discriminant Function Analysis Procedure
The Mantel procedure assumes no three-factor
interaction and therefore is not powerful in the detection
of non-uniform differential functioning (Swaminathan &
![Page 44: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/44.jpg)
37
Rogers, 1990; Welch & Hoover, 1993). Logistic discriminant
function analysis was used to detect uniform and non-uniform
DIF in a recent empirical study by Miller and Spray (1993)
which included one polytomously scored section.
In the discriminant procedure, the probability of group
membership G (focal and reference) is modeled as a function
of two explanatory variables: ability score X, and testlet
response score U.
The full logistic discriminant model can be written as
Prob(G\X,U) = -if- , (5) l + ez
where Z is the linear combination
Z = p0-piz+p2y+p3z*a . (6)
The 6's are coefficients estimated from the data, and X * U
represents the interaction between the ability score U and
the testlet score X.
To assess the fit of the model, a likelihood statistic
Ga is calculated. Norusis provided a clear description of
the model fit. statistic.
![Page 45: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/45.jpg)
38
The probability of the observed results given the parameter estimates is known as the likelihood. Since the likelihood is a small number less than 1, it is customary to use -2 times the log of the likelihood (-2LL) as a measure of how well the estimated model fits the data. A good model is one that results in a high likelihood of the observed results.
To test the null hypothesis that the observed likelihood does not differ from 1 (the value of the likelihood for a model that fits perfectly), you can use the value of -2LL. Under the null hypothesis that the model fits perfectly, -2LL has a chi-square distribution with N - p degrees of freedom, where N is the number of cases and p is the number of parameters estimated. (Norusis, 1990, p. 52)
Testing for non-uniform DTLF involves first fitting the
full model which combines Equations 5 and 6. Then the model
is reduced by deleting the interaction term, and equation 6
is replaced by
Z = P0+P1^+P2£7 . (7)
The significance of S3 is tested by calculating the
difference of fit between the full and reduced models.
Next, the test for uniform DTLF involves fitting the
null model, which contains only the ability score X. In the
null model, Z becomes
Z = p0 + M • (8)
![Page 46: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/46.jpg)
39
If there is a significant difference of fit between the null
and reduced models, the testlet is flagged as potentially
containing uniform DTLF.
For each model the G2 statistic (-2LL) is calculated.
Differences in G2 values between pairs of models (symbolized
as G2(Jiff) tests the null hypothesis that the coefficient
deleted at the last step is zero. The S2di« statistic is
distributed as chi-square with one degree of freedom, and is
comparable to the F-change test in multiple regression
(Miller & Spray, 1993; Norusis, 1990).
NELS:88 Data Set
Demographic and test data for this study were obtained
from the National Education Longitudinal Study of 1988
(NELS:88), an existing publicly accessible data set
sponsored by the National Center for Education Statistics
(NCES). Four study components constitute the base year
design: surveys and tests of students, and surveys of
parents, school administrators, and teachers.
A two-stage stratified probability design was used to
select a nationally representative sample of schools and
students for the NELS:88 data set. The base year sample is
composed of approximately 24,600 eighth graders who were
sampled from 1,052 schools throughout the United States.
The NCES long-range plan is to monitor the transition of the
![Page 47: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/47.jpg)
40
students through high school and then to college or
employment.
The NELS:88 User's Manual describes confidentiality
safeguards:
The NELS:88 base year data is released in accordance with the provisions of the General Education Provision Act (GEPA) and the Carl D. Perkins Vocational Education Act. The GEPA assures privacy by ensuring that respondents will never be individually identified.
To ensure that the confidentiality provisions contained in PL 100-297 have been fully implemented, procedures commonly applied for disclosure avoidance in other government-sponsoring surveys were used in preparing the data tape associated with this manual. These include suppressing, abridging, and recoding identifiable variables. Every effort has been made to provide the maximum research information that is consistent with reasonable confidentiality protections. (NCES, 1990. p. iv)
The NELS:88 data include student responses to a battery
of tests in four subject matter areas: reading,
mathematics, science, and social studies. The tests include
21, 40, 25, and 30 items, respectively.
The NELS:88 test contains eight common passage or
explicit testlets. Alternately, implicit testlets may be
formed by grouping items according to common content area or
common process. Test specification charts in the NCES
(1991) report list content areas for each item in all four
sections, and process areas for items in three of the
sections. For example, Item 28 in the mathematics section
![Page 48: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/48.jpg)
41
is part of the arithmetic content area and the problem
solving process area.
The number of testlets in each testlet classification
and in each of the sections is shown in Table 3. In the
reading section, the defined content areas are the same as
the common passage testlets. The science section contains
no explicit testlets, and no process areas are defined for
the social studies section.
Table 3
NELS:88 Testlet Types in Each Section
Testlet Type
Common Common Common Section passage content process
Reading 5 — 3
Mathematics 2 5 3
Science — 4 3
Soc. stud. 1 3 —
Total 8 12 9
The number of items in each of the testlets are shown
in Table 4. Generally, the common passage testlets have
fewer items than the common content or common process
testlets.
![Page 49: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/49.jpg)
Table 4
NELS:88 Number of Items in Testlets
42
Testlet Type
Section Common Common Common Section Passage Content Process Total
Reading 5,3,6,4,3 — 4,14,3 21
Math. 2,2 11,4,19,2,4 17,19,4 40
Science — 8,7,2,8 8,10,6 25
Soc. stud. 5 3,14,13 — 30
Total items 116
Internal consistency reliabilities based on coefficient
Alpha for each section were reported. The reliabilities
were quite acceptable in reading, mathematics, and social
studies (0.84, 0.90, and 0.83, respectively). The science
test showed less reliability with a coefficient of 0.75. A
factor analysis also indicated that the science section was
less unifactorial than the reading, mathemeitics, and social
studies sections.
Differential item functioning analyses for ethnic and
gender groups were performed on all items by the Educational
Testing Service (ETS), using the MH procedure. Thin
matching was used for stratification, with the total section
score used as the matching variable. Very little DIF was
![Page 50: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/50.jpg)
43
evident in the analyses, with the most being found in the
social studies area (NCES, 1991).
Brookshire (1993) investigated the presence of
differential item functioning in the NELS:88 test data,
using the MH procedure and thin matching. In that study,
the demographic subgroups identified were geographic region,
socioeconomic status, and urbanicity (urban, suburban, and
rural) designations. Similar to the ETS analysis, most of
the DIF was discovered in the social studies section.
![Page 51: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/51.jpg)
CHAPTER 4
PRESENTATION AND ANALYSIS OF DATA
The purpose of this study was to compare applications
of two statistical methods in the detection of differential
testlet functioning (DTLF). The 29 testlets were
categorized into three groups: common passage, common
content, and common process. Only the first category,
common passage, includes testlets which fit the standard
definition of testlet, one where items are grouped together
on the test to answer questions related to the same passage,
figure, or case study. The other two categories are implied
testlets.
Because all 29 testlets were analyzed with three
different sample sizes, 87 testlet/sample size possible
cases were considered. In this chapter, the 87 possible
combinations are referred to as cases.
Naming conventions for the testlets are shown in
Appendix A. The items corresponding to each testlet (with
testlet types) are listed in Appendix B.
Over 600 chi-squares were calculated during the data
analysis stage. For each of the three sample sizes, the 29
testlets were analyzed using three different types of chi-
squares, and the 116 individual items were analyzed using
44
![Page 52: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/52.jpg)
45
one type of chi-square. All chi-square values with
associated probabilities are listed in Appendix C.
For each research question, appropriate tallies were
made according to the comparison addressed by the question.
The tally sheets reduced the massive amount of data into
summaries. Exceptions or inconsistencies were marked on the
tally sheets. The marked chi-squares were then analyzed to
determine the degree of inconsistency.
The value of a = .001 was chosen as the level of
significance. With the Mantel chi-square statistic,
rejection of the null hypothesis indicated that the groups
differed significantly on testlet performance, even when
matched on underlying ability. When using the uniform
logistic discriminant chi-square statistic, rejection of H0
indicated that the probability of group membership differed
significantly with the addition of testlet score into the
equation. For the non-uniform logistic situation, the
interaction between testlet score and observed score was
added for consideration. In all three statistics, rejection
of the null hypothesis flagged the testlet as potentially
containing DTLF.
Data Analysis
Sample Data
The data analyzed were drawn from the NELS:88 data
base. Each of the 8 explicit testlets and the 21 implicit
![Page 53: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/53.jpg)
46
testlets (as described in Table 3) were analyzed for uniform
differential testlet functioning, using both the Mantel and
the logistic discriminant procedures. The logistic
discriminant procedure was used to detect non-uniform
functioning. Each of the 116 single items in the test were
analyzed for differential item functioning for comparison
purposes.
The SAMPLE command in SPSS (SPSS Inc., 1990) was used
to select samples of examinees for the reference group and
the focal group, by chosen demographic subgroups. Gender
was chosen as the demographic variable, with males as the
reference group and females as the focal group.
Logistic Discriminant Function Analysis Procedure
The logistic regression procedure from SPSS was used to
perform logistic discriminant function analysis (Hosmer &
Lemeshow, 1989; Miller & Spray, 1993; Norusis, 1990). The
SPSS program is listed in Appendix D.
The dichotomous dependent variable was demographic
subgroup (gender). The independent variables were total
section score, testlet score, and the interaction (product)
of total section score and testlet score. The total section
score was considered the best available measure of overall
ability, the ability score.
To account for the collinearity between the total
section score and the testlet score with the interaction
![Page 54: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/54.jpg)
47
product, the raw scores on the two variables were centered.
Centering consists of placing variables in the deviation
score form so that their means become zero (Aiken & West,
1991).
Three models were fit to the data. First, the full
model includes all three independent variables. Second, the
reduced model deletes the interaction variable. Last, the
null model includes only the constant and the total section
score. For each model the G2 (-2LL) model fit statistic was
computed.
The improvement statistic G2diff between the full and
reduced models was estimated. If the G2dlff statistic was
significant, the null hypothesis of no improvement was
rejected, and non-uniform DTLF was suspected.
Similarly, G2difS between the reduced and null was
calculated. Uniform DTLF was suspected if the statistic
showed significance.
Mantel Score Test Procedure
For the Mantel procedure, the total section score was
considered the best measure of overall ability for the trait
being measured by the testlets and the items within that
section. That score was used to stratify the sample
according to the total percentage matching procedure, with
the combined (reference and focal) percentages used for
calibration (Donoghue & Allen, 1993).
![Page 55: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/55.jpg)
48
The FREQ procedure in SAS with the CMH option (SAS
Institute, 1990) was used to calculate the Mantel chi-square
statistics and the accompanying probabilities (Zwick et al.,
1993). The SAS code is listed in Appendix E.
Mantel-Haenszel Procedure
For individual items, the Mantel-Haenszel procedure was
used. This procedure is almost identical to the Mantel
score test procedure, with a slight adjustment in the
calculation of the chi-square (Holland & Thayer, 1988). A
Pascal program was written to calculate the chi-square value
of the individual items. The program is listing is shown in
Appendix F.
In Mantel-Haenszel calculations, variable choices were
the same as in the Mantel procedure. The total section
score was used to stratify the sample, with combined
percentages used for calibration (Donoghue & Allen, 1993).
Results
This section includes referrals to percentages in both
the tables and the discussions. All percentages have been
rounded to the nearest whole percentage.
Research Question 1
Do the Mantel score test of conditional independence
procedure and the logistic discriminant function analysis
![Page 56: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/56.jpg)
49
procedure detect the same differential testlet functioning
in the same testlets?
For this research question, the uniform LDFA result and
the Mantel result for uniform DTLF were compared for each
testlet at each sample size. The counts of cases where the
two statistics were inconsistent with regard to rejection of
the null hypothesis are shown in Table 5. Of the 87
different testlet/sample size cases, only 3 of the compared
pairs failed to match in H0 rejection at the 0.001 level of
significance. All three marked cases fall into the common
content category and are in the sample of 1,000.
Table 5
Non-Matches in H, Rejection for LDFA-U vs. Mantel-U fp<.0011
Sample size
Testlet Type 500 1000 2000 Ratio
Common passage 0:8 0:8 0:8 0:24 (0%)
Common content 0:12 3:12 0:12 3:36 (8%)
Common process 0:9 0:9 0:9 0:27 (0%)
Total 0:29 3:29 0:29 3:87 (3%)
The values of the chi-squares and associated
probabilities for the marked cases are shown in Table 6.
Although the Mantel chi-squares were not rejected at the
![Page 57: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/57.jpg)
50
0.001 level, the chi-square statistics are relatively high,
and the probabilities are relatively low.
Table 6
Chi-Sauares and Probabilities for Non-Matches
Testlet N
LDFA-U
Chi-Sq. Prob.
Mantel
Chi-Sq. Prob.
T22 1000
T28 1000
T29 1000
* 13.271 .0003
* 18.442 .0000
* 13.402 .0003
9.852 .0017
7.765 .0053
10.12 .0015
E < .001
Research Question 2
Do the Mantel score test of conditional independence
procedure and the logistic discriminant function analysis
procedure detect both uniform and non-uniform differential
testlet functioning to the same extent?
For this research question, two comparisons were made.
For both comparisons, the LDFA non-uniform chi-square (the
only chi-square statistic used to flag non-uniform DTLF) was
used as the basis for comparison. The non-uniform statistic
was compared to each of the uniform DTLF detection methods,
regarding whether or not the null hypothesis was rejected.
(See Appendix C.)
![Page 58: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/58.jpg)
51
First, the non-uniform LDFA statistic was compared to
the uniform LDFA statistic, regarding whether or not the
null hypothesis was rejected. Counts of cases showing
inconsistency in rejection of H0 are indicated in Table 7.
A case is counted when one, but not both, of the chi-squares
for a case is significant at the 0.001 level.
Table 7
Non-Matches in EL Rejection for LDFA-N vs. LDFA-U (pc.OOl^
Sample size
Testlet Type 500 1000 2000 Ratio
Common passage — — 3 3:24 (13%)
Common content 1 4 4 9:36 (25%)
Common process — 2 5 7:27 (26%)
Total 19:87 (22%)
Data in Table 7 show that 19 out of a possible 87
testlet/sample size possible cases, or 22%, did not match in
rejection of the null hypothesis when the non-uniform
statistic was compared with the logistic uniform statistic.
The number of marked cases increased with the increase in
the size of the sample. Matches in H0 rejection were not
expected in comparing uniform with non-uniform, because a
testlet may contain differential functioning of only one
![Page 59: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/59.jpg)
52
type. About 13% of the common passage cases, as well as 25%
of common content and 26% of common process options, are
marked.
Second, the non-uniform rejections were compared to the
Mantel uniform rejections. The cases where one was rejected
but not both are counted in Table 8.
Table 8
Non-Matches in H. Rejection for LDFA-N vs. Mantel-U (pc.0011
Sample size
Testlet Type 500 1000 2000 Ratio
Common passage — — 3 3:24 (13%)
Common content 1 1 4 6:36 (17%)
Common process — 2 5 7:27 (26%)
Total 16:87 (18%)
The percentage of non-matching cases in Table 8 is
approximately 18%, which is similar to results presented in
Table 7. Total number of cases increased as sample size
increased. Again, matches were not expected. The testlet
types show variation in Table 8, with 13% of common passage
cases marked, 17% of common content, and 26% of common
process cases.
![Page 60: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/60.jpg)
53
Research Question 3
To what extent does variation in sample size influence
detection of differential testlet functioning?
The occurrences of significant chi-squares are
displayed in Table 9 according to sample size. At the 0.001
level of significance, only 1 testlet out of 29 (3%) was
flagged for differential testlet functioning in the sample
of 500. As the sample size increased, more cases were
flagged, with 8 testlets out of 29 (28%) showing DTLF in the
sample of 1,000, and 17 out-of 29 (59%) in the sample of
2,000. Overall, approximately 30% of the 87 possibilities
indicated one or more types of DTLF.
Table 9
Summary of Significant Chi-Sauares fp<.001^
Sample size
Testlet
type 500 1000 2000 Ratio
Common passage — — 4 4:24 (17%)
Common content 1 6 8 15:36 (42%)
Common process — 2 5 7:27 (26%)
Total ratios 1:29(3%) 8:29(28%) 17:29(59%) 26:87 (30%)
Data in Table 9 show that 17% of the common passage
cases were marked, while 42% of the common content and 26%
of the common process types of cases were flagged. Common
![Page 61: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/61.jpg)
54
Table 10
Occurrences of Significant Chi-Squares ("pc.OOll
Sample size
Testlet Type Testlet 500 1000 2000
Common T2 NUM passage
T3 — — — — N
T9 — N
T26 — UM
Common T12 N content
T15 UM UM UM
T19 — NUM NUM
T20 — — N
T21 — NUM N
T22 — U NUM
T28 — U NUM
T29 — U NUM
Common T6 «. . . N process
T8 — — — N
T23 — — N
T24 — N N
T25 — N N
Note: Statistical methods: N = Non-Uniform LDFA-N, U = Uniform LDFA-U, M = Mantel Uniform
![Page 62: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/62.jpg)
55
content testlets show the most potential DTLF, and common
passage testlets show the least.
The statistical methods with significant chi-squares
are coded in Table 10. Only non-uniform DTLF was flagged in
common process testlets, with a variety of DTLF in other
types of testlets.
Research Question 4
How do the results of differential item functioning
differ from differential testlet functioning when the Mantel
score test of conditional independence procedure is used for
both analyses?
This is the only research question which addresses the
116 individual items which make up the testlets, and
possible occurrence of differential item functioning (DIF).
The items with significant chi-squares at the 0.001
level of significance are listed in Table 11. As the sample
size increased, so did the number of significant chi-
squares, with most items occurring in the sample of 2,000.
In comparing item significance to corresponding testlet
significance, the Mantel-Haenszel chi-square for items and
the Mantel chi-square for testlets were used. (See Appendix
C.) The situation was considered inconsistent if the
testlet and at least half of the items did not match in H0
rejection.
![Page 63: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/63.jpg)
Table 11
Items Showing Significance at p<.001
56
Sample size
Section 500 1000 2000
Reading — — RE 6
— — RE 12
— — RE 13
Mathematics — — MAI 2
— MA20 MA20
— MA 2 5 MA25
— — HA 2 8
Science SC4 SC4 SC4
— — SC12
— — SCI 5
Social Studies — — SSll
— SS12 SSI 2
SS21 SS21 SS21
The instances showing inconsistency between testlet and
item H0 rejection are counted in Table 12. Twelve out of 87
instances (about 14%) show some degree of inconsistency.
The common content cases showed more inconsistencies
did than the other two types. Ten (28%) of the common
content testlets were marked, compared to two and one of the
other types (8% and 4%).
![Page 64: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/64.jpg)
57
Table 12
Inconsistencies Between Testlets and Associated Items
Sample size
Testlet Type 500 1000 2000 Ratio
Common passage — — 2 2:24 (8%)
Common content 2 2 6 10:36 (28%)
Common process — 1 — 1:27 (4%)
Total ratio 12: 87 (14%.)
For the inconsistent cases flagged in Table 12, it is
interesting to note how many of the items were flagged for
DIF. The number of items flagged for DIF, out of the number
of items on a testlet, are shown in Table 13. Only the
testlets in Table 12 were analyzed.
![Page 65: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/65.jpg)
58
Table 13
Ratio of Significant Items to Testlet Items in Flagged Cases
Sample size
Testlet type Testlet 500 1000 2000
Common passage
T2
T26 — —
1:3
0:5
Common content
T15
T19
0:4 1:4
0:8
1:4
1:8
T21 * 1:2 — * 1:2
T22 — — 2:8
T28 — — 1:14
T29 — — 0:13
Common process
T18 — * 2:4 —
Note: In most cases, the testlet was significant and the items were not significant. The * indicates that the items were significant and the testlet was not significant.
![Page 66: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/66.jpg)
CHAPTER 5
FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS
The purpose of this study to was compare results of
distinctive statistical methodologies in searching for
differential functioning in test items and testlets. Five
factors were varied: statistical method, uniformity,
testlet type, unit of analysis, and sample size. Subjects
were randomly chosen from the NELS:88 data base of over
25,000 eighth-grade students. Scores were analyzed for
differential functioning using programs written in SPSS,
SAS, and Pascal.
The EXCEL spreadsheet was used to organize the data.
Comparisons were made on the five factors that were varied
in the design of the study.
In this chapter, the logistic discriminant procedure
which compares the full model with the reduced model and is
used to detect non-uniform functioning is termed LDFA-N.
Similarly, the procedure used to detect uniform functioning
which compares the reduced model with the null model is
termed LDFA-U.
59
![Page 67: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/67.jpg)
60
Findings
Research Question 1
Do the Mantel score test of conditional independence
procedure and the logistic discriminant function analysis
procedure detect the same differential testlet functioning
in the same testlets?
In general, the two methods showed similar results.
Less than 4% of the cases failed to match when the Mantel
and LDFA-U results were compared. The three cases which
were rejected by the LDFA-U and not by the Mantel were close
in the chi-square values and probabilities.
All cases in the common passage category, the eight
explicit testlets, matched in all three varieties of sample
sizes. Therefore, using the most common definition of a
testlet, the results indicate that the two methodologies had
perfect consistency in detection of differential testlet
functioning.
All three cases which failed to show consistency were
in the second category, the common content testlets. These
were testlets with items which were not presented together
but contained the same type of content. In the other
implied testlet type, common process, all cases showed
consistency in comparison of the statistical methods.
![Page 68: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/68.jpg)
61
Research Question 2
Do the Mantel score test of conditional independence
procedure and the logistic discriminant function analysis
procedure detect both uniform and non-uniform differential
testlet functioning to the same extent?
Of the three processes used to inspect testlets for
DTLF, only the LDFA-N was expected to reveal non-uniform
DTLF. Two procedures were presumed to reveal uniform DTLF:
the Mantel uniform and the LDFA-U.
In both situations of comparing LDFA-N results with the
two uniform procedures, the uniform and non-uniform failed
to match in 18% to 22% of the comparisons. As anticipated,
the uniform and non-uniform results were not consistent.
When the LDFA-N results were contrasted to the uniform
LDFA-U results, the findings were almost identical with
comparison of LDFA-N to the Mantel uniform results. Only 3
cases out of 87 showed inconsistency. All three cases were
in the second category of testlet types, the common content
testlets. No common passage or common process cases showed
inconsistency between the two comparisons. In all three
instances of incompatibility, the LDFA-U approach flagged
the testlet where the Mantel did not, possibly indicating a
stronger method in the logistic discriminant technigue.
![Page 69: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/69.jpg)
62
Research Question 3
To what extent does variation in sample size influence
detection of differential testlet functioning?
In general, the number of testlets indicating
differential testlet functioning increased as a function of
sample size. As the sample size increased from 500 to 1,000
to 2,000, the percentages of testlets showing possible DTLF
increased from 3%, to 28%, and finally to 59%.
Most of the DTLF occurred in the implicit types of
testlets: those where items were spread throughout the tes.t
and not grouped by a common passage. The testlets implied
by common content, the second testlet category, had the
highest rate of DTLF, with 42% of the cases flagged. Those
cases with item grouping implied by common processes had a
rate of 25%, while only 16% of the conventional common
passage cases showed possible DTLF.
Research Question 4
How do the results of differential item functioning
differ from differential testlet functioning when the Mantel
procedure is used for both analyses?
The Mantel testlet outcomes were compared with the
Mantel-Haenszel item outcomes. The greatest percentage of
inconsistencies, 28%, occurred in the common content type of
testlets, with the other two types showing only 8% and 4%.
![Page 70: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/70.jpg)
63
In most instances of inconsistency, the testlets were
significant where less than half of the items were
significant. There is no clear explanation for this
phenomenon. In one of the two marked cases in the standard
common passage testlets, none of the five items had
significant chi-squares and yet the testlet was selected.
This was the only research question to address
individual items. As in testlet results, the increase in
the sample size was positively correlated with the increase
in the number of items selected. In the sample of 500, only
2 items were flagged for DIF. As the sample increased to
1,000, the same 2 items and 3 more were indicated. In the
largest sample, the Mantel-Haenszel procedure selected the
same 5 items and 8 more, making a total of 13 items with
potential differential functioning. The flagged items in
the largest sample were rather evenly spread among the four
sections of the NELS:88 test: reading, mathematics,
science, and social studies.
Conclusions
This study contributes new information to the
literature of differential testlet functioning as well as
verifying previous research. The common capabilities of the
LDFA-U method and the Mantel method in detecting DTLF was
revealed once again, as was the lack of the power of the
Mantel score test procedure to detect non-uniform
![Page 71: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/71.jpg)
64
differential functioning. As the sample size increased, the
number of testlets with suspected DTLF and of items with
potential DIF also increased.
Of particular interest was the abundance of anomalies
in the second category of testlets, the common content
testlets. These implied testlets were composed of items
which were not physically grouped together on the test, but
merely contained similar content. Most previous research
involving testlets has used only common passage testlets,
which are grouped together with a reading passage, picture,,
or case study. The common passage testlets generally
performed as expected in this study. But the implied
testlets showed inconsistent performance.
Perhaps the most interesting finding was the
inconsistency between some of the testlets and associated
items regarding detection of differential functioning. Only
two of the standard common passage testlets showed
inconsistency, but there is no indication of a reason for
such erratic results.
Recommendations
There is a scarcity of research available in the area
of testlets, and particularly of differential testlet
functioning. This study opens up several possibilities for
a number of research projects.
![Page 72: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/72.jpg)
65
DIF and DTLF Inconsistencies
Why do some testlets show a likelihood for differential
functioning when none of the items, or only a small
percentage of the items, show no such likelihood? The need
to scrutinize possible explanations and form hypotheses
exists. Then studies based upon the hypotheses can be
planned and performed.
Post Hoc Tests
If a testlet is flagged as a potential carrier of DTLF,
there is no clear follow-up procedure for verifying the
degree of differential functioning. Post hoc procedures are
needed for test developers to further screen testlets for
inclusion on tests.
Implicit Testlets
Very few, if any, previous studies have used testlets
which are defined by common content or common process, as
opposed to the standard definition of a common passage.
More research is needed in this area.
Polytomous Models
Polytomous DIF models are appropriate for DTLF
exploration methodologies. If dichotomous scoring
procedures are used with testlet scores (for example, pass
or fail), then much statistical information is wasted.
![Page 73: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/73.jpg)
66
Two polytomous DIF models, the Mantel score test method
and the logistic discriminant function analysis method were
chosen for this study. Comparison on the variables of
uniform/non-uniform, sample size, and testlet DTLF/single
item DIF indicates limitation to two models for comparison
to create an appropriate degree of complexity for this
study.
The level of response allowed by the various polytomous
methods was a deciding factor in choosing the methods for
this study. The lowest appropriate response level for a
testlet based method is ordinal level, because the possible
testlet responses are ordered. Again, using a lower
measurement scale classification results in wasted
statistical information and lower precision. The Mantel and
discriminant methods both analyze ordinal level responses.
Some of the logistic regression models consider
ordering of responses but were not chosen because they
require many separate model estimations and interpretation
is confusing (Swaminathan & Rogers, 1990). Miller and Spray
(1992) found in a simulation study that the continuation
ratio logit analysis method failed to flag nonuniform DIF
under certain conditions. Both the Mantel and discriminant
methods produce statistics that allow straightforward
interpretation of research results (Miller & Spray, 1992?
Miller & Spray, 1993).
![Page 74: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/74.jpg)
67
The other polytomous methods discussed in the
literature review section, but not included in this study,
are limited to the nominal level response categories. Those
models are the generalized Cochran Mantel Haenszel (Zwick,
Donoghue, & Grima, 1993), other logistic regression
procedures (Agresti, 1990; Miller & Spray, 1993), t-test
procedures HW1 and HW3 (Welch & Hoover, 1993), and Bock's
nominal model (Wainer, Sireci, & Thissen, 1991).
The latent trait polytomous methods, mentioned in the
literature review, are not included in this study. IRT-
based procedures are "sensitive to sample size and model-
data fit and are expensive in terms of computer time"
(Swaminathan & Rogers, 1990). Future research efforts
should compare polytomous latent trait methods with observed
score methods for testlet based tests.
Testlet Design and Scoring
In the context of screening paper and pencil test
testlets for future use in testlet pools for adaptive
testing, only the linear structured testlets are considered.
In a CAT screening format it is possible to analyze testlets
with a hierarchical structure.
The number right was the testlet score used in this
study. Other testlet scoring strategies have been offered
in the literature. For example, Wainer and Kiely (1987)
![Page 75: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/75.jpg)
68
discussed using the response pattern score for
hierarchically structured testlets.
Ability Score
Typically the total test score is used for conditioning
on the ability measure in observed score methods of
differential functioning analysis, as discussed in the
literature review. In this study, the total section score
was used as the ability measure in both procedures.
The flexibility of the logistic discriminant procedure
would allow other scores to be investigated as predictors of
group membership. For example, a separate test purported to
measure the overall ability of interest could be used as an
independent variable in the equation (Miller & Spray, 1993).
Conceivably, demographic variables could be used as
conditioning variables to predict group membership.
Thick and Thin Matching
As noted in the literature review, a study by Donoghue
and Allen (1993) offered seven different levels of matching,
from thin through various degrees of thickness to no
matching (as an extreme by which to compare the other
methods). This study used one type of thick matching (total
percentage matching) which is reported to be the most
appropriate for Mantel chi-square procedures. More types of
matching should be compared on studies using Mantel or
Mantel-Haenszel procedures.
![Page 76: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/76.jpg)
69
Other Issues
' Many other questions arise from areas of this study or
from current literature. Just a few of the questions are
listed here.
1. In the context of adaptive testing, do various
paths differentiate between two subgroups?
2. What causes DIF or DTLF? Sometimes DIF items have
no logical reason to be flagged with differential
functioning.
3. What cognitive processes cause uniform and non-
uniform DTLF?
![Page 77: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/77.jpg)
APPENDIX A
TESTLET NAMING CONVENTION
70
![Page 78: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/78.jpg)
71
TESTLET NAMING CONVENTION
Common Passage Testlets f 8 'i : Reading Testlets: T1-T5 Math Testlets: T9-T10 Social Stu. Testlet: T26
Common Content Testlets C121: Math Testlets: T11-T15 Science Testlets: T19-T22 Social Stu. Testlets: T27-T29
Common Process Testlets f91: Reading Testlets: T6-T8 Math Testlets: T16-T18 Science: T23-T25
![Page 79: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/79.jpg)
APPENDIX B
NELS:88 TESTLET ITEMS
72
![Page 80: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/80.jpg)
73
NELS:88 TESTLET ITEMS
Reading Testlets: T1 = RE1 to RE5 T2 = RE6 to RE8 T3 = RE9 to RE14 T4 = RE15 to RE18 T5 = RE19 to RE21 T6 = RE1 to RE3, RE6 T7 = RE4, RE5, RE7, RE10 to RE14,
RE16 to RE21 T8 = RE8, RE9, RE15
common passage passage passage passage passage process rrepro-detail
process:inference/eval process:comprehension
Mathematics Testlets: T9 = MA2, MA3 passage T10 = MA6, MA7 passage Til = MAI, MA4, MA7, MA14, MA15, MA26,
MA27, MA29, MA34, MA39, MA40 content: T12 = MA 2, MA3, MA21, MA24 content: T13 = MA5, MA8, MA9, MA10, MAI2, MA13,
MA16 to MA20, MA22, MA23, MA28, MA30 to MA33, MA36
content:arithmetic T14 = MA6, MA35 content: T15 = MA11, MA25, MA37, MA38 content: T16 = MAI, MA3, MAS, MA6, MA8, MA9,
MA12, MAI3, MAI5 to MA19, MA22, MA25, MA34, MA40
T17 = MA2, MA4, MA7, MA10, MA11, MA14, MA20, MA21, MA24, MA26, MA27, MA29, MA31, MA32, MA33, MA36 to MA39 process:
T18 = MA23, MA28, MA30, MA35 process:
Science Testlets:
algebra data/prob
adv topics geometry
process:skill/know
und/comp prob solv
T19 = SCI, SC2, SC5, SC7, SC8, SC12, SC18 , SC21
T20 = SC3, SC10, SC11, SC14, SC19, SC20 , SC23
T21 = SC4, SC24 T22 = SC6, SC9, SCI3, SCI5 to SC17,
SC22 , SC25 T23 = SCI, SC4, SC13, SC14, SC20, SC22
SC23 , SC25 T24 = SC2 , SC5, SC6, SC8 to SC10, SC12
SC15 , SC18, SC19 T25 = SC3, SC7, SC16, SC17, SC21, SC24
content:earth sci
content:chemistry content:sci method
content:life sci
process:prob solv
process:decl know
![Page 81: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/81.jpg)
74
Social Studies Testlets: T26 = SS5 to SS9 passage T27 = SSI, SS12, SS26 content:geography T28 = SS2, SS4, SS10, SS11, SS13, SS14,
SSI7, SS18, SS20, SS21, SS25, SS27, SS28, SS29 content:history
T29 = SS3, SS5 to SS9, SS15, SS16, SS19, SS22 to SS24, SS30 content:citizenship
![Page 82: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/82.jpg)
APPENDIX C
CHI-SQUARE AND PROBABILITY VALUES FOR TESTLETS AND ITEMS
75
![Page 83: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/83.jpg)
SUMMARY (N=500)
TESTLET 1 76
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq | Prob Sig Item Chi-Sq Prob Sig
0.137 0.7113 0.020 0.8875 0.073 0.7870 RE1 0.153 0.6958 RE2 0.760 0.3834
TESTLET 2
RE3 0.013 0.9110
TESTLET 2
RE4 0.001 0.9789
TESTLET 2
RE5 0.283 0.5948
TESTLET 2
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
5.217 0.0224 0.308 0.5789 0.244 0.6213 RE6 0.606 0.4365 RE7 0.301 0.5835 RE8 0.312 0.5766
TESTLET 3
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.255 0.26261 0.000 1.0000 0.002 0.9643 RE9 0.796 0.3724 RE10 0.574 0.4489 RE11 0.001 0.9761 RE12 2.518 0.1125 RE13 0.375 0.5405 RE14 0.818 0.3659
TESTLET 4
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Man.tel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob fSig Item Chi-Sq Prob Sig
1.702 0.1920 0.001 0.9748 0.004 0.9496| RE15 0.093 0.7601 RE16 0.034 0.8539
TESTLET 5
RE17 0.005 0.9453
TESTLET 5
RE18 0.204 0.6514
TESTLET 5
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-D IF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.083 0.7733 0.659| 0.4169 0.478 0.4893 RE19 0.226 0.6346
RE20 0.011 0.9153 RE21 0.249 0.6177
![Page 84: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/84.jpg)
SUMMARY (N=500)
TESTLET 6 77
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantei-DTLF (Uniform)
Mantel-Haensze!-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
2.516 0.1127 1.374 0.2411 1.691 0.1935 RE1 0.153 0.6957 RE2 0.760 0.3833
TESTLET 7
RE3 0.013 0.9092
TESTLET 7
RE6 0.606 0.4363
TESTLET 7
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
0.817 0.3661 0.329 0.5662 0.182 0.6697 RE4 0.001 0.9748 RES 0.283 0.5947
TESTLET 8
RE7 0.301 0.5833
TESTLET 8
RE10 0.574 0.4487
TESTLET 8
RE11 0.001 0.9748
TESTLET 8
RE12 2.518 0.1126
TESTLET 8
RE13 0.375 0.5403
TESTLET 8
RE14 0.818 0.3658
TESTLET 8
RE16 0.034 0.8537
TESTLET 8
RE17 0.005 0.9436
TESTLET 8
RE18 0.204 0.6515
TESTLET 8
RE19 0.226 0.6345
TESTLET 8
RE20 0.011 0.9165
TESTLET 8
RE21 0.249 0.6178
TESTLET 8
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig
2.3131 0.1283 0.1681 0.6819 0.209 0.6476 RE8 0.312 0.5765 RE9 0.796 0.3723 RE15 0.093 0.7604
TESTLET 9
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantei-Haenszel-DiF (Uniform)
Chi-Sq | Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.148 0.2840 0.023 0.8795I 0.023 0.8795 MA2 6.468 0.0110 MA3 4.204 0.0403
![Page 85: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/85.jpg)
SUMMARY (N=500)
TESTLET 10 78
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantei-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.145 0.7034 0.516 0.4726 0.279 0.5974 MA6 0.582 0.4455
MA7 0.002 0.9643
TESTLET 11
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq J Prob Sig Item Chi-Sq Prob Sig 0.171 0.6792 5.262 0.0218 5.085 0.0241 MA1 0.006 0.9383
MA4 0.000 1.0000 MA7 0.002 0.9643 MA14 0.158 0.6910 MA15 0.156 0.6929 MA26 0.188 0.6646 MA27 0.055 0.8146 MA29 2.713 0.0995 MA34 2.646 0.1038 MA39 0.876 0.3493 MA40 3.599 0.0578
TESTLET 12
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq |Prob Sig Item Chi-Sq Prob Sig 1.771 0.1833 1.305 0.2533 1.453 0.2280 MA2 6.468 0.0110
MA3 4.204 0.0403 MA21 3.257 0.0711 MA24 0.053 0.8179
![Page 86: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/86.jpg)
SUMMARY (N=500)
TESTLET 13 79
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantei-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item iChi-Sq Prob Sig
0.420 0.5169 1.961 0.1614 1.023 0.3118 MA5 5.106 0.0238 MA8 0.011 0.9165 MA9 2.850 0.0914 MA10 1.605 0.2052 MA12 2.709 0.0998 MA13 0.295 0.5870 MA16 6.268 0.0123 MA17 0.099 0.7530 MA18 0.431 0.5115 MA19 0.328 0.5668 MA20 6.314 0.0120 MA22 0.371 0.5425 MA23 2.560 0.1096 MA28 5.008 0.0252 MA30 5.139 0.0234 MA31 0.011 0.9165 MA32 0.009 0.9244 MA33 0.081 0.7759 MA36 0.315 0.5746
TESTLET 14
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.465 0.2261 0.494 0.4821 0.346 0.5564 MA6 0.582 0.4455 MA35 0.001 0.9748
TESTLET 15
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
0.200 0.6547 15.346 0.0001 * 13.454| 0.0002 * MA11 2.521 0.1123 MA25 5.799 0.0160 MA37 3.794 0.0514 MA38 1.545 0.2139
![Page 87: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/87.jpg)
SUMMARY (N=5Q0)
TESTLET 16 80
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haensze!-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.095 0.7579 0.022 0.88211 0.016 0.8993 MA1 0.006 0.9383
MA3 4.204 0.0403
TESTLET 17
MA5 5.106 0.0238
TESTLET 17
MA6 0.582 0.4455
TESTLET 17
MA8 0.011 0.9165
TESTLET 17
MA9 2.850 0.0914
TESTLET 17
MA12 2.709 0.0998
TESTLET 17
MA13 0.295 0.5870
TESTLET 17
MA15 0.156 0.6929
TESTLET 17
MA16 6.268 0.0123
TESTLET 17
MA17 0.099 0.7530
TESTLET 17
MA18 0.431 0.5115
TESTLET 17
MA19 0.328 0.5668
TESTLET 17
MA22 0.371 0.5425
TESTLET 17
MA25 5.799 0.0160
TESTLET 17
MA34 2.646 0.1038
TESTLET 17
MA40 3.599 0.058
TESTLET 17
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.010 0.3149 0.772 0.3796 0.438 0.5081 MA2 6.468 0.0110 MA4 0.000 1.0000 MA7 0.002 0.9643 MA10 1.605 0.2052 MA11 2.521 0.1123 MA14 0.158 0.6910 MA20 6.314 0.0120 MA21 3.257 0.0711 MA24 0.053 0.8179 MA26 0.188 0.6646 MA27 0.055 0.8146 MA29 2.713 0.0995 MA31 0.011 0.9165 MA32 0.009 0.9244 MA33 0.081 0.7759 MA36 0.315 0.5746 MA37 3.794 0.0514 MA38 1.545 0.2139 MA39 0.876 0.3493
![Page 88: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/88.jpg)
SUMMARY (N=500)
TESTLET 18 81
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq |Prob |Sig Chi-Sq jProb Siq Chi-Sq Prob Sig Item | Chi-Sq Prob Sig 0.413 0.5205 2.488 0.1147 2.328 0.1271 MA23 2.560 0.1096
MA28 5.008 0.0252 MA30 5.139 0.0234 MA35 0.001 0.9748
TESTLET 19
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq | Prob jSig Chi-Sq |Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 6.2771 0.0122) 8.570 0.0034 10.398 0.0013 SC1 0.257 0.6122
SC2 1.060 0.3032 SC5 0.065 0.7988 SC7 2.043 0.1529 SC8 3.866 0.0493 SC12 1.756 0.1851 SC18 0.198 0.6563 SC21 0.703 0.4018
TESTLET 20
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantei-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq jProb |Sig Chi-Sq iProb Sig Chi-Sq Prob Sig Item Chi-ScL Prob Sig 1.459| 0.22711 0.565 0.4523 0.871 0.3507 SC3 1.065 0.3021
SC10 0.002 0.9643 SC11 0.029 0.8648 SC14 0.121 0.7280 SC19 1.641 0.2002 SC20 0.162 0.6873 SC23 0.157 0.6919
TESTLET 21
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq |Prob |Sig Chi-Sq [Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 8.0131 0.00461 7.566 0.0059 6.821 0.0090 SC4 12.167 0.0005 *
SC24 0.004 0.9496
![Page 89: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/89.jpg)
SUMMARY (N=500)
TESTLET 22 82
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.281 0.2577} 4.989 0.0255 2.673 0.1021 SC6 0.009 0.9244
SC9 0.628 0.4281 SC13 0.000 1.0000 SC15 0.279 0.5974 SC16 1.514 0.2185 SC17 0.004 0.9496 SC22 1.980 0.1594 SC25 0.050 0.8231
TESTLET 23
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DI F (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.995 0.1578 2.015 0.1558 1.341 0.2469 SC1 0.257 0.6122
TESTLET 24
SC4 12.167 0.0005 *
TESTLET 24
SC13 0.000 1.0000
TESTLET 24
SC14 0.121 0.7280
TESTLET 24
SC20 0.162 0.6873
TESTLET 24
SC22 1.980 0.1594
TESTLET 24
SC23 0.157 0.6919
TESTLET 24
SC25 0.050 0.8231
TESTLET 24
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob J Sig Chi-Sg^ Prob Sig Item Chi-Sq Prob Sig 3.332 0.0679 2.199 0.1381 3.848 0.0498 SC2 1.060 0.3032
SC5 0.065 0.7988 SC6 0.009 0.9244 SC8 3.866 0.0493 SC9 0.628 0.4281 SC10 0.002 0.9643 SC12 1.756 0.1851 SC15 0.279 0.5974 SC18 0.198 0.6563 SC19 1.641 0.2002
![Page 90: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/90.jpg)
SUMMARY (N=500)
TESTLET 25 83
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
5.612| 0.0178 0.046 0.8302 0.036 0.8495 SC3 1.065 0.3021
TESTLET 26
SC7 2.043 0.1529
TESTLET 26
SC16 1.514 0.2185
TESTLET 26
SC17 0.004 0.9496
TESTLET 26
SC21 0.703 0.4018
TESTLET 26
SC24 0.004 0.950
TESTLET 26
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob |Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig
0.179 0.6722 1.537 0.2151 0.409 0.5225 SS5 1.123 0.2893 SS6 0.430 0.5120 SS7 0.052 0.8196 SS8 0.007 0.9333 SS9 0.063 0.8018
TESTLET 27
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.055 0.3044 0.354 0.5519 0.325 0.5686 SS1 1.742 0.1869 SS12 3.241 0.0718
TESTLET 28
SS26 4.003 0.0454
TESTLET 28
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-ScL Prob [Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
1.376 0.2408 5.384 0.0203 3.871 0.0491 SS2 3.221 0.0727 SS4 0.755 0.3849 SS10 2.629 0.1049 SS11 0.763 0.3824 SS13 0.994 0.3188 SS14 1.987 0.1587 SS17 0.214 0.6437 SS18 0.029 0.8648 SS20 0.119 0.7301 SS21 26.908 0.0000 *
SS25 1.120 0.2899 SS27 0.237 0.6264 SS28 4.326 0.0375 SS29 0.311 0.57711
![Page 91: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/91.jpg)
SUMMARY (N=5C50)
TESTLET 29 84
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 2.856 0.0910 3.837 0.0501 1.811 0.1784 SS3 0.026 0.8719
SS5 1.123 0.2893 SS6 0.430 0.5120 SS7 0.052 0.8196 SS8 0.007 0.9333 SS9 0.063 0.8018 SS15 0.153 0.6957 SS16 0.007 0.9333 SS19 0.031 0.8602 SS22 0.480 0.4884 SS23 0.336 0.5621 SS24 3.834 0.0502 SS30 1.268 0.2601
p < .001
![Page 92: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/92.jpg)
SUMMARY (N=100Q)
TESTLET 1 85
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig
0.027 0.8695 0.108 0.742 0.034 0.8537 RE1 0.993 0.3191 RE2 0.414 0.5201 RE3 0.484 0.4867 RE4 0.196 0.6582 RE5 0.508 0.4760
TESTLET 2
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF
(Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq | Prob Sig Chi-Sq | Prob Sig Chi-Sq Prob I Sig item Chi-Sq Prob Sig
6.741 0.0094 7.064 0.0079 5.864 0.0155 RE6 7.9010 0.0049 RE7 0.0000 1.0000 RE8 1.8898 0.1692
TESTLET 3
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF
(Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq | Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
6.2431 0.0125 0.075 0.7842 0.131 0.7174 RE9 1.360 0.2435 RE10 0.195 0.6585 RE11 0.760 0.3832 RE12 4.938 0.0263 RE13 4.398 0.0360 RE14 0.165 0.6842
TESTLET 4
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq (Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
3.342 0.0675 0.090 0.7642 0.236 0.6269 RE15 0.009 0.9236 RE16 0.313 0.5758
TESTLET 5
RE17 0.164 0.6858
TESTLET 5
RE18 0.319 0.5723
TESTLET 5
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
0.003 0.9563 3.199 0.0737 3.041 0.0812 RE19 0.386 0.5346 RE20 3.764 0.0524 RE21 0.069 0.7924
![Page 93: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/93.jpg)
SUMMARY (N=1000)
TESTLET 6 86
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
4.662 0.0308 1.631 0.2016 1.900 0.1681 RE1 0.993 0.3191 RE2 0.414 0.0049
TESTLET 7
RE3 0.484 0.4866
TESTLET 7
RE6 7.901 0.0049
TESTLET 7
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.092 0.0431 1.208 0.2717 1.190 0.2754 RE4 0.196 0.6580
RE5 0.508 0.4760 RE7 0.000 1.0000 RE10 0.195 0.6585 RE11 0.760 0.3833 RE12 4.938 0.0263 RE13 4.398 0.0360 RE14 0.165 0.6846 RE16 0.313 0.5758 RE17 0.164 0.6855 RE18 0.319 0.5722 RE19 0.386 0.5344 RE20 3.764 0.0524 RE21 0.069 0.7928
TESTLET 8
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
2.494 0.1143 0.030 0.8625 0.000 0.9862 RE8 1.890 0.1692
TESTLET 9
RE9 1.360 0.2435
TESTLET 9
RE15 0.009 0.9244
TESTLET 9
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 3.972 0.0463 0.193 0.6604 0.209 0.6476 MA2 0.735 0.3914
MA3 1.806 0.1790
![Page 94: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/94.jpg)
SUMMARY (N=1000)
TESTLET 10 87
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.125 0.7237 0.248 0.6185 0.152 0.6966 MA6 1.635 0.2010
MA7 0.679 0.4099
TESTLET 11
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig item Chi-Sq Prob Sig 1.187 0.2759| 1.994 0.1579 1.418 0.2337 MA1 1.420 0.2334
MA4 0.331 0.5652 MA7 0.679 0.4099 MA14 0.393 0.5308 MA15 3.266 0.0707 MA26 0.047 0.8293 MA27 3.428 0.0641 MA29 0.871 0.3507 MA34 1.867 0.1718 MA39 1.372 0.2415 MA40 0.059 0.8084
TESTLET 12
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.084 0.0433) 0.044 0.8339 0.099 0.7530 MA2 0.735 0.3914
MA3 1.806 0.1790 MA21 2.983 0.0841 MA24 0.117 0.7329
![Page 95: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/95.jpg)
SUMMARY (N=1000)
TESTLET 13 88
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig 1.508 0.2194 4.067 0.0437 1.881 0.1702 MAS 5.022 0.0250
MA8 0.145 0.7038 MA9 1.767 0.1837 MA10 2.236 0.1348 MA12 6.136 0.0132 MA13 0.066 0.7967 MA16 1.481 0.2236 MA17 0.155 0.6941 MA18 3.532 0.0602 MA19 4.046 0.0443 MA20 13.339 0.0003 • *
MA22 0.388 0.5335 MA23 3.179 0.0746 MA28 8.339 0.0039 MA30 9.404 0.0022 MA31 0.120 0.7287 MA32 0.079 0.7782 MA33 0.001 0.9748 MA36 1.530 0.2161
TESTLET 14
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq [Prob Sig Item Chi-Sq Prob Sig 1.866 0.1719 2.329 0.1270 2.166 0.1411 MA6 1.635 0.2010
MA35 0.387 0.5339
TESTLET 15
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 0.031 0.8602 16.442 O
o
o
o
v?
15.496 0.0001 MA11 2.658 0.1030 MA25 13.298 0.0003 ' '• * - ̂
MA37 3.044 0.0810 MA38 0.742 0.3891
![Page 96: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/96.jpg)
SUMMARY (N=1000)
TESTLET 16 89
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszet-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq | Prob Sig Item Chi-Sq Prob Sig
1.098 0.2947 0.110 0.7401 0.211 0.6457 MA1 1.420 0.2334 MA3 1.806 0.1790 MA5 5.022 0.0250 MA6 1.635 0.2010 MA8 0.145 0.7034 MA9 1.767 0.1837 MA12 6.136 0.0132 MA13 0.006 0.9362 MA15 3.266 0.0707 MA16 1.481 0.2236 MA17 0.155 0.6941 MA18 3.532 0.0602 MA19 4.046 0.0443 MA22 0.388 0.5335 MA25 13.298 0.0003 : Jijjjy; MA34 1.867 0.1718 MA40 0.059 0.8084
TESTLET 17
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-ScL Prob Sig Item Chi-Sq Prob Sig
2.114 0.1460 0.889 0.3457 0.893 0.3447 MA2 0.735 0.3914 MA4 0.331 0.5652 MA7 0.679 0.4099 MA10 2.236 0.1348 MA11 2.658 0.1030 MAI 4 0.393 0.5307 MA20 13.339 0.0003 r ;t MA21 2.983 0.0841 MA24 0.117 0.7329 MA26 0.047 0.8293 MA27 3.428 0.0641 MA29 0.871 0.3507 MA31 0.120 0.7287 MA32 0.079 0.7782 MA33 0.001 0.9748 MA36 1.530 0.2161 MA37 3.044 0.0810 MA38 0.742 0.3890 MA39 1.372 0.2415
![Page 97: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/97.jpg)
SUMMARY (N=1000)
TESTLET 18 90
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob |Sig Item Chi-Sq Prob Sig 2.262 0.1326 3.890 0.0486 3.183 0.0744! MA23 3.179 0.0746
MA28 8.339 0.0039
TESTLET 19
MA30 9.404 0.0022
TESTLET 19
MA35 0.387 0.5339
TESTLET 19
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DI F (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 18.912 0.0000 * . 16.044 0.0001 * 16.048 0.0001 SC1 1.251 0.2634
SC2 1.425 0.2326 SC5 0.001 0.9748 SC7 4.245 0.0394 SC8 2.681 0.1016 SC12 3.357 0.0669 SC18 0.694 0.4048 SC21 1.359 0.2437
TESTLET 20
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
3.900 0.0483 1.801 0.1796 1.094 0.2956 SC3 5.699 0.0170 SC10 0.003 0.9563 SC11 0.681 0.4092 SC14 0.004 0.9496 SC19 5.372 0.0205 SC20 0.056 0.8136 SC23 1.308 0.2528
TESTLET 21
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 11.756 0.0006 11.025| 0.0009 11.50510.0007 SC4 17.640 0.0000
SC24 0.005 0.9436
![Page 98: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/98.jpg)
SUMMARY (N=10Q0)
TESTLET 22 91
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.293 0.0383 13.271 0.0003 * 9.852) 0.0017 SC6 0.861 0.3535
SC9 0.001 0.9748 SC13 2.593 0.1073 SC15 10.161 0.0014 SC16 1.693 0.1932 SC17 0.001 0.9748 SC22 2.728 0.0986 SC25 0.806 0.3693
TESTLET 23
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.268
CO CO CO o b 2.436 0.1186 2.673 0.1020 SC1 1.251 0.2634
SC4 17.640 0.0000 " ~*r SC13 2.593 0.1073 SC14 0.004 0.9496 SC20 0.056 0.8129 SC22 2.728 0.0986 SC23 1.308 0.2528 SC25 0.806 0.3693
TESTLET 24
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Ma ntel-Haenszel-DI F (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-ScL Prob Sig Item Chi-Sq^ Prob Sig 11.809 0.0006 1.719 0.1898 2.061 0.1511 SC2 1.425 0.2326
SC5 0.001 0.9748 SC6 0.861 0.3535 SC8 2.681 0.1016 SC9 0.001 0.9748 SC10 0.003 0.9563 SC12 3.357 0.0669 SC15 10.161 0.0014 SC18 0.694 0.4048 SC19 5.372 0.0205
![Page 99: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/99.jpg)
SUMMARY (N=1000)
TESTLET 25
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform) Chi-Sq IProb |sig
Mantel-Haenszel-DIF (Uniform) " 'Chi-Sq
Chi-Sq IP rob |Sig 11,47810.00071 '
Chi-Sq|Prob [Sig Item Prob 0.0170
Sig
0.044I 0.8339T 0.0261 0.8721] SC3 SC7
5.699 4.245' 0.0394
SC16 1.693 0.1932 SC17 0.001 0.9748 SC21 0.005 0.9436 SC24 0.005 0.9436
TESTLET 26
LDFA-DTLF (Non-Uniform) Chi-Sq IProb [Sig
LDFA-DTLF (Uniform) Chi-Sq IProb |Sig
Mantel-DTLF (Uniform) Chi-Sq IProb Isig
Mantel-Haenszel-DIF (Uniform) Item I Chi-Sq I Prob Sig
4.460I 0.0347T 10.2771 0.00131" 7.0321 0.0080r 555 556
5.372 0.360
0.0205 0.5485
SS7 6.295 0.0121 SS8 0.385 0.5347 SS9 5.178 0.0229
TESTLET 27
LDFA-DTLF (Non-Uniform) Chi-Sa Prob I Sig
LDFA-DTLF (Uniform) Chi-Sq IProb Sig
Mantel-DTLF (Uniform) Chi-Sq IProb I Sig
Mantel (Unifor Item
-Haensz m) Chi-Sq
el-DIF
Prob o n A f l
Sig
5.052| 0.02461 1.1221 0.2895 1.984J 0.15901 SS1 SS12 SS26
1.072 12.486 4.630
Q.3UUO 0.0004 0.0314
TESTLET 28
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
l£Si^i34£^====!^= 5.044I 0.02471
Chi-Sq IProb I Sig 18.4421 0.00001 *
Chi-Sq IProb ISig I 7.7651 0.0053| \
Item 1 SS2
cni-sq | 1.399 0.2369
Sig
SS4 SS10
Q.2Z4 0.041
U.OuOU 0.8395
SS11 9.806 0.0017 SS13 1.446 0.2292 SS14 0.171 0.6792 SS17 0.025 0.8744 SS18 0.476 0.4902 SS20 1.556 0.2123 SS21 30.096 0.0000 SS25 4.742 0.0294 SS27 0.714 0.3982 SS28 1.892 0.1690 SS29 0.976 0.3233
![Page 100: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/100.jpg)
SUMMARY (N=1000)
TESTLET 29 93
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 10.565 0.0012 13.402 0.0003 10.120 0.0015 SS3 0.490 0.4841
SS5 5.372 0.0205 SS6 0.360 0.5485 SS7 6.295 0.0121 SS8 0.385 0.5349 SS9 5.178 0.0229 SS15 1.540 0.2146 SS16 0.397 0.5287 SS19 0.059 0.8086 SS22 2.852 0.0913 SS23 1.906 0.1674 SS24 10.662 0.0011 SS30 2.401 0.1213
< .001
![Page 101: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/101.jpg)
SUMMARY (N=2000)
TESTLET 1 94
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantei-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.168 0.2798 1.344 0.2463 0.995 0.3184 RE1 3.638 0.0565
RE2 0.001 0.9748 RE3 1.279 0.2580 RE4 3.937 0.0472 RE5 3.489 0.0618
TESTLET 2
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 12.647 0.0004 fr 26.546 0.0000 i * , i. 22.254 0.0000 RE6 14.995 0.0001 * .
RE7 1.630 0.2017 RE8 10.171 0.0014
TESTLET 3
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 16.432 0.0001 0.112 0.7379 0.350 0.5541 RE9 0.868 0.3516
RE10 0.524 0.4692 RE11 5.551 0.0185 RE12 14.701 0.0001 RE13 12.278 0.0005 RE14 0.831 0.3619
TESTLET 4
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
7.836 0.0051 2.614 0.1059 3.454 0.0631 RE15 3.323 0.0683 RE16 4.439 0.0351 RE17 0.147 0.7011 RE18 0.054 0.8159
TESTLET 5
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
0.001 0.9748 3.445 0.0634 3.986 0.0459 RE19 0.163 0.6866 RE20 7.800 0.0052 RE21 0.055 0.8146
![Page 102: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/102.jpg)
SUMMARY (N=20Q0)
TESTLET 6 95
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 15.633 0.0001 0.962 0.3267 1.568 0.2105 RE1 3.638 0.0565
RE2 0.001 0.9748 RE3 1.279 0.2580 RE6 14.995 0.0001
TESTLET 7
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantei-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig I Item Chi-Sq Prob Sig
9.8 0.0017 1.415 0.2342 2.490 0.1146 RE4 3.937 0.0472 RE5 3.489 0.0618 RE7 1.630 0.2017 RE10 0.524 0.4691 RE11 5.551 0.0185 RE12 14.701 0.0001 RE13 12.278 0.0005 RE14 0.831 0.3620 RE16 4.439 0.0351 RE17 0.147 0.7014 RE18 0.054 0.8162 RE19 0.163 0.6864 RE20 7.8 0.0052 RE21 0.055 0.8146
TESTLET 8
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
7.854 0.0051 0.322 0.5704 0.054 0.8159 RE8 10.171 0.0014 RE9 0.868 0.3516 RE15 3.323 0.0683
TESTLET 9
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 11.815 0.0006 0.274 0.6007 0.513 0.4738 MA2 1.1247 0.2889
MA3 3.392 0.0655
![Page 103: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/103.jpg)
SUMMARY (N=200Q)
TESTLET 10 96
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DI F (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 1.841 0.1748 0.61 0.4348 1.295 0.2551 MA6 0.0089 0.9248
MA7 2.171 0.1407
TESTLET 11
LDFA-DTLF (Non-Uniform) •
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 7.532 0.0061 3.292 0.0696 3.799 0.0513 MA1 3.390 0.0656
MA4 0.092 0.7619 MA7 2.171 0.1406 MA14 6.116 0.0134 MA15 9.990 0.0016 MA26 0.674 0.4118 MA27 3.686 0.0549 MA29 0.036 0.8499 MA34 8.290 0.0040 MA39 2.181 0.1397 MA40 0.003 0.9571
TESTLET 12
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 14.158 0.0002 'ii-S-l 0.146 0.7024 0.1371 0.7112 MA2 1.125 0.2889
MA3 3.392 0.0655 MA21 7.080 0.0078 MA24 0.614 0.4335
![Page 104: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/104.jpg)
SUMMARY (N=2000)
TESTLET 13 97
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 8.426 0.0037 9.024 0.0027 5.5076 0.0189 MA5 6.746 0.0094
MA8 0.111 0.7391 MA9 6.881 0.0087 MA10 2.955 0.0856 MA12 12.417 0.0004 *
MA13 2.433 0.1188 MA16 2.511 0.1131 MA17 0.191 0.6624 MA18 1.191 0.2752 MA19 10.372 0.0013 MA20 16.525 0.0000 *
MA22 1.016 0.3135 MA23 1.800 0.1797 MA28 21.591 0.0000
t MA30 9.772 0.0018 MA31 0.099 0.7528 MA32 0.001 0.9712 MA33 0.232 0.6304 MA36 0.119 0.7297
TESTLET 14
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
6.611 0.0101 3.698 0.0545 2.875 0.0900 MA6 0.0089 0.9248 MA35 6.156 0.0131
TESTLET 15
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 2.132 0.1443 33.237 0.0000 : '1 29.412 0.0000 4 # MA11 5.360 0.0206
MA25 15.162 0.0001 r MA37 8.462 0.0036 MA38 3.199 0.0737
![Page 105: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/105.jpg)
SUMMARY (N=2000)
TESTLET 16 98
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 7.635 0.0057 1.514 0.2185 1.437 0.2306 MA1 3.390 0.0656
MA3 3.392 0.0655 MA5 6.746 0.0094 MA6 0.009 0.9248 MA8 0.111 0.7391 MA9 6.881 0.0087 MA12 12.417 0.0004 *
MA13 2.433 0.1188 MA15 9.990 0.0016 MA16 2.511 0.1131 MA17 0.191 0.6624 MA18 1.191 0.2752 MA19 10.372 0.0013 MA22 1.016 0.3135 MA25 15.162 0.0001 MA34 8.290 0.0040 MA40 0.003 0.9571
TESTLET 17
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 11.225 0.0008 6.767 0.0093 3.699 0.0544 MA2 1.125 0.2889
MA4 0.092 0.7619 MA7 2.171 0.1407 MA10 2.955 0.0856 MA11 5.360 0.0206 MA14 6.116 0.0134 MA20 16.525 0.0000 # '
MA21 7.080 0.0078 MA24 0.614 0.4335 MA26 0.674 0.4118 MA27 3.686 0.0549 MA29 0.036 0.8499 MA31 0.099 0.7528 MA32 0.001 0.9712 MA33 0.232 0.6304 MA36 0.119 0.7297 MA37 8.462 0.0036 MA38 3.199 0.0737 MA39 2.181 0.1397
![Page 106: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/106.jpg)
SUMMARY (N=200G)
TESTLET 18 99
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig
9.236 0.0024 4.456 0.0348 3.771 0.0521 MA23 1.800 0.1797 MA28 21.591 0.0000 MA30 9.772 0.0018 MA35 0.099 0.7528
TESTLET 19
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 45.656 0.0000 ' . 29.532 0.0000 ' J*.v 26.839 0.0000 SC1 0.092 0.7616
SC2 1.039 0.3081 SC5 0.263 0.6081 SC7 8.857 0.0029 SC8 6.878 0.0087 SC12 10.901 0.0010 SC18 4.127 0.0422 SC21 1.385 0.2393
TESTLET 20
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 13.756 0.0002 2.8 0.0943 1.213 0.2707 SC3 5.343 0.0208
SC10 0.014 0.9058 SC11 3.489 0.0618 SC14 0.465 0.4953 SC19 4.318 0.0377 SC20 0.075 0.7842 SC23 0.002 0.9643
TESTLET 21
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DiF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 15.045 0.0001 7.665 0.0056 8.617 0.0033 SC4 16.941 0.0000
SC24 0.049 0.8248
![Page 107: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/107.jpg)
SUMMARY (N=2000)
TESTLET 22 100
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DiF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 14.272 0.0002 *>' - 32.702 0.0000 27.437 0.0000 SC6 3.544 0.0598
SC9 0.003 0.9563 SC13 12.118 0.0005 , *-
SC15 14.477 0.0001 ; *
SC16 6.666 0.0098 SC17 0.174 0.6766 SC22 6.258 0.0124 SC25 0.216 0.6421
TESTLET 23
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 16.097 0.0001 /*• . 9.688 0.0019 10.159 0.0014 SC1 0.092 0.7616
SC4 16.941 0.0000 SC13 12.118 0.0005 , *
SC14 0.465 0.4953 SC20 0.075 0.7842 SC22 6.258 0.0124 SC23 0.002 0.9643 SC25 0.216 0.6421
TESTLET 24
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 26.643 0.0000 3.58 0.0585 3.287 0.0698 SC2 1.039 0.3081
SC5 0.263 0.6081 SC6 3.544 0.0598 SC8 6.878 0.0087 SC9 0.003 0.9563 SC10 0.014 0.9058 SC12 10.901 0.0010 - f-
SC15 14.477 0.0001 SC18 4.127 0.0422 SC19 4.318 0.0377
![Page 108: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/108.jpg)
SUMMARY (N=2000)
TESTLET 25 101
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 26.963 0.0000 0.092 0.7616 0 1.0000 SC3 5.343 0.0208
SC7 8.857 0.0029 SC16 6.666 0.0098 SC17 0.174 0.6766 SC21 1.385 0.2393 SC24 0.049 0.8248
TESTLET 26
LDFA-DTLF (Non-Uniform)
LDFA-DTLF (Uniform)
Mantel-DTLF (Uniform)
Mantel-Haenszel-DIF (Uniform)
Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 4.608 0.0318 18.385 0.0000 11.787 0.0006 SS5 4.966 0.0259
SS6 3.613 0.0573 SS7 7.163 0.0074 SS8 1.877 0.1707 SS9 5.8246 0.0158
TESTLET 27
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 10.573 0.0011 3.293 0.0696 4.655 0.0310 SS1 2.621 0.1055
SS12 20.498 0.0000 SS26 5.296 0.0214
TESTLET 28
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 12.141 0.0005 35.608 0.0000 15.842 0.0001 SS2 0.181 0.6704
SS4 2.667 0.1025 SS10 0.213 0.6444 SS11 12.893 0.0003 SS13 6.154 0.0131 SS14 0 1.0000 SS17 1.78 0.1821 SS18 0.156 0.6929 SS20 0.111 0.7390 SS21 54.491 0.0000 SS25 3.486 0.0619 SS27 2.468 0.1162 SS28 10.486 0.0012 SS29 0.016 0.8993
![Page 109: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/109.jpg)
SUMMARY (N=2000)
TESTLET 29 102
LDFA-DTLF LDFA-DTLF Mantel-DTLF Mantel-Haenszel-DIF (Non-Uniform) (Uniform) (Uniform) (Uniform) Chi-Sq Prob. Sig Chi-Sq Prob Sig Chi-Sq Prob Sig Item Chi-Sq Prob Sig 17.793 0.0000 23.936 0.0000 18.161 0.0000 SS3 2.922 0.0874
SS5 4.966 0.0259 SS6 3.613 0.0573 SS7 7.163 0.0074 SS8 1.877 0.1707 SS9 5.8246 0.0158 SS15 0.7521 0.3858 SS16 0.3 0.5839 SS19 4.263 0.0390 SS22 10.158 0.0014 SS23 0.172 0.6783 SS24 10.703 0.0011 SS30 5.456 0.0195
p < .001
![Page 110: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/110.jpg)
APPENDIX D
SPSS SAMPLE PROGRAM FOR LOGISTIC DISCRIMINANT FUNCTION ANALYSIS
103
![Page 111: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/111.jpg)
Sample SPSS Program for LDFA 104
TITLE 'LDFA, SAMPLE=500, GROUP=GENDER' GET FILE=/SAMP500 SYS A' COMPUTE OBSCR=RESCORE IF GENDER=1 GROUP=l IF GENDER=2 GROUP=0 ADD VALUE LABELS GROUP 0 'FEMALE' 1 'MALE' ****** FIRST TESTLET COMPUTE TLTSCR=T1 COMPUTE INTERACT=OBSCR*TLTSCR *** CALCULATE MEANS **** COMPUTE TEMPVAR=1 AGGREGATE OUTFILE='TEMP MEAN A' /BREAK=TEMPVAR /MEANOBS=MEAN(OBSCR) /MEANTLT=MEAN(TLTSCR)
MATCH FILES FILE=* /TABLE='TEMP MEAN A' /BY TEMPVAR ******* CENTER THE VARIABLES TO REDUCE COLLINEARITY ******* BY SUBTRACTING THE MEAN COMPUTE CENOBS=OBSCR-MEANOBS COMPUTE CENTLT=TLTSCR-MEANTLT COMPUTE CENINT=CENOBS * CENTLT *** LOGISTIC DISCRIMINANT FUNCTION ANALYSIS LOGISTIC REGRESSION GROUP WITH CENOBS CENTLT CENINT /METHOD=ENTER CENOBS /METHOD=ENTER CENTLT /METHOD=ENTER CENINT
![Page 112: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/112.jpg)
APPENDIX E
SAS SAMPLE PROGRAM FOR MANTEL SCORE TEST PROCEDURE
105
![Page 113: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/113.jpg)
Sample SAS Program for Mantel Procedure 106
DATA ? INFILE 'TLT02OOO ASC A7; INPUT GROUP RESTRATA T1-T8 MASTRATA T9-T18
SCSTRATA T19-T25 SSSTRATA T26-T29; PROC FREQ; TABLES SSSTRATA * GROUP * T26 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T27 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T28 / NOPRINT CMH; TABLES SSSTRATA * GROUP * T29 / NOPRINT CMH;
![Page 114: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/114.jpg)
APPENDIX F
PASCAL SAMPLE PROGRAM FOR MANTEL-HAENSZEL PROCEDURE
107
![Page 115: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/115.jpg)
Sample Pascal Program for 108 Mantel-Haenszel Procedure
Program MantelHaenszel (datafile,outfile,output);
* Calculates Mantel-Haenszel chi-square statistic for single-item reading DIF.
}
Uses Crt; const SampSize=500; {EDIT CONSTANTS AS NEEDED}
Dat='\procomm\[email protected]'; Out='manre.dat'; Subject='re'; MaxStrata=5; Maxltem=21;
Type FreqArrayType=Array [0..1,0..1,1..MaxStrata,1..MaxItem] of integer;
Var Datafile,Outfile : Text; Freq: FreqArrayType; i,j,k,s: Integer; {i=group; j=item score;
k=strata or total score; s=studied item)
Mantel, Denominator : Real; ( * * * * * * * * * * * * * * * * * * * )
Procedure GetData (Var Datafile,Outfile:Text; Var Freq:FreqArrayType);
Var dummy, i, j, k : Integer; Begin
For i:=0 to 1 do {Initialize Array} For j:=0 to 1 do
For k:=l to MaxStrata do For s:=l to Maxltem do
Freq[i,j,k,s]:=0; Assign (Datafile,dat); {Prepare Files} Assign (Outfile,out); Reset(Datafile); Rewrite(Outfile); While not eof (Datafile) do {Fetch Data}
Begin While not eoln (Datafile) do Begin
Read (Datafile,i,k); For s:=l to Maxltem do
Begin Read (Datafile, j); Inc(Freq[i,j,k,s]) ;
End; End; Readln (Datafile);
End; { ADD THIS SECTION FOR DEBUGGING
For s:=l to Maxltem do Begin
For k:=l to MaxStrata do
![Page 116: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/116.jpg)
Begin 109
Writeln ('Strata',k:2); For i:=0 to 1 do Begin For j:=0 to 1 do
Write (Output,Freq[i,j,k,s]:4); Writeln;
End; Writeln;
End ; End;
}
End; ( * * * * * * * * * * * * * * * * * * * )
Function WtdFocalSum (k,s:Integer):Real; Var WFS: Real;
i,j: Integer; Begin
WFS:=0; If i=0 Then
For j:=0 to 1 do WFS:=WFS + ((j+1) * Freq[i,j,k,s]);
WtdFocalSum:=WFS; End;
( * * * * * * * * * * * * * * * * * * * )
Function ExpSum (k,s:Integer):Real; Var TS, FS, WMS: Real;
i,j: Integer; Begin
TS:=0; FS:=0; WMS:=0; For j:=0 to 1 do
FS:=FS + Freq[ 0, j , k, s ]; For i:=0 to 1 do
For j:=0 to 1 do TS:=TS + Freq[ i, j , k, s ];
For j:=0 to 1 do WMS:=WMS + ((j+1) * (Freq[0,j,k,s] + Freq[1,j,k,s]));
ExpSum:=FS/TS*WMS; End;
( ******************* ) Function VarSum (k,s:Integer):Real; Var RS,FS,TS,SWMS,WMS,VS: Real;
i,j: Integer; Begin
RS:=0; FS:=0; TS:=0; SWMS:=0; WMS:=0; VS:=0; For j:=0 to 1 do
Begin RS:=RS + Freq[0,j,k,s]; FS:=FS + Freq[l,j,k,s]; WMS:=WMS + ((j+1) * (Freq[0,j,k,s] + Freq[1,j,k,s]));
S W M S : = S W M S + ( ( j + 1 ) * (j + 1 ) *(Freq[0, j,k,s]+Freq[l,j,k,s]));
![Page 117: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/117.jpg)
End; 110 TS:=FS + RS; VS:=(TS * SWMS) - (WMS * WMS); VarSum:= RS * FS / (TS * TS * (TS - 1)) * VS;
End; (*******************)
Begin {main} ClrScr; Writeln ('Input File: ',Dat); Writeln ('Output File: ',Out); GetData (Datafile,Outfile,Freq); For s:=l to Maxltem do Begin Mantel:=0.0; For k:=l to MaxStrata do Mantel:=Mantel + WtdFocalSum(k,s) - ExpSum(k,s);
Mantel:=Abs(Mantel) - 0.5; {correction for single item} Mantel:=Mantel * Mantel; Denominator:=0.0; For k:=l to MaxStrata do
Denominator:=Denominator + VarSum(k,s); Mantel:=Mantel / Denominator; Writeln; Write ('For Item':10,s:3); {Write Mantel to Screen} Write (': Mantel-Haenszel Chi Square: ':10); Writeln (Mantel:8:4); Write (OutFile.Subject^s-.l,',') ? {Write Mantel to File} Writeln (OutFile,Mantel:8:4);
End; Close (OutFile);
End. {main}
![Page 118: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/118.jpg)
BIBLIOGRAPHY
Adema, J. J. (1991). The construction of customized two-stage tests. Journal of Educational Measurement. 27., 241-253.
Agresti, A. (1990). Categorical data analysis. New York: Wiley.
Aiken, S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: SAGE.
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Belmont, CA: Wadsworth.
Angoff, W. H. (1972, September). A technigue for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service No. ED 069686)
Angoff, W. H. (1982). Use of difficulty and discrimination indices for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 96-116). Baltimore: Johns Hopkins University Press.
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, NJ: Lawrence Erlbaum Associates.
Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement. 10. 95-106.
Berk, R. A. (Ed.), (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika. 37. 29-51.
Brookshire, W. K. (1993). Differential item functioning in the National Education Longitudinal Study of 1988 test battery. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin, TX.
Ill
![Page 119: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/119.jpg)
112
Camilli, G., & Shepard, L. A. (1987). The inadequacy of ANOVA for detecting test bias. Journal of Educational Statistics. 12.(1), 87-99.
Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational and Psychological Measurement. 28., 61-75.
Cole, N. S. (1993). History and development of DIF. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
Crehan, K. D., Sireci, S. G., Haladyna, T. M., & Henderson, P. A. (1993). A comparison of testlet reliability for polytomous scoring methods. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.
Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart & Winston.
Dillon, G. F., Henzel, T. R., Klass, D. J., LaDuca, A., & Peskin, E. (1993). Presenting test items clustered around patient cases: Psychometric concerns and practical implications for a medical licensure program. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.
Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal of Educational Statistics. 18(2), 131-154.
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization.In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Kulik, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement. 23. 355-368.
Embretson, S. E. (Ed.), (1985). Test design: Developments in psychology and psychometrics. Orlando, FL: Academic Press.
Green, B. F. (1983). Adaptive testing by computer. In R.B. Ekstrom (Ed.), Measurement, technology, and individuality in education: New directions for testing and measurement No. 17. San Francisco: Jossey-Bass.
![Page 120: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/120.jpg)
113
Green, B. F. (1988). Construct validity of computer-based tests. In H. Wainer & H. I. Braun (Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M. (1992). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5, 73-88.
Hambleton, R. K., Zaal, J. N. & Pieters, J. P. M. (1991). Computerized adaptive testing: Theory, applications, and standards. In R. K. Hambleton & J. N. Zaal (Eds.). Advances in educational and psychological testing: Theory and applications. Boston: Kluwer Academic Publishers.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp.129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.
Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
Kim, H., & Plake, B. S. (1993). Monte carlo simulation comparison of two-stage testing and computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.
Kingsbury, G. G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive test. Applied Measurement in Education. 2, 359-375.
Kingsbury, G. G., Zara, A. R. & Houser, R. L. (1993). Procedures for using response latencies to identify unusual test performance in computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.
Lam, T. L., & Foong, Y. Y. (1991). Development and evaluation of hierarchical testlets in two-stage tests using integer linear programming. Paper presented at the Annual Meeting of AERA, Chicago, IL.
Lewis, C., & Sheehan, K. (1990). Using bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement. 14. 367-386.
![Page 121: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/121.jpg)
114
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Lunz, M. E., & Stahl, J. A. (1993). Test targeting and precision before and after review on computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.
Mantel, N. (1963). Chi-square tests with one degree of freedom, extensions of the Mantel-Haenszel procedure. American Statistical Association Journal. 58, 690-700.
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute. 22. 719-748.
Marascuilo, L. A., & Slaughter, R. E. (1981). Statistical procedures for identifying possible sources of item bias based on chi-square statistics. Journal of Educational Measurement, 18. 229-248.
McArthur, D. L. (Ed.), (1989). Alternative approaches to the assessment of achievement. Boston: Kluwer Academic.
Mellenberg, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics. 7, 105-118.
Miller, T. R., & Spray, J. A. (1992). A comparison of three methods for identifying nonuniform DIF in polytomously scored test items. Paper presented at the Psychometric Society Meeting, Columbus, OH.
Miller, T. R., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal of Edu<-«^inna>^ Measurement. 30. 107-122.
National Center for Education Statistics (1990). User's manual: National education logitudinal study of 1988. Publication of U.S. Department of Education: Office of Educational Research and Improvement. (NCES 90-464)
![Page 122: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/122.jpg)
115
National Center for Education Statistics (1991). Technical Report: Psychometric report for the NELS:88 base year test battery. Publication of U.S. Department of Education: Office of Educational Research and Improvement. (NCES 91-468)
Norusis, M. J. (1990). SPSS advanced statistics user's guide. Chicago: SPSS.
Osterlind, S. J. (1983). Test item bias. Beverly Hills: Sage.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests: Expanded edition. Chicago: University of Chicago Press.
Reshetar, R. A., Norcini, J. J., & Shea, J. A. (1993). A simulated comparison of two content balancing and maximum information item selection procedures for an adaptive certification examination. Paper presented at the Annual Meeting of AERA, Atlanta, GA.
Rosenbaum, P. R. (1988). Items bundles. Psychometrika. 53. 349-359.
SAS Institute (1990). SAS procedures guide, version 6 (3rd ed.}. Cary, NC: Author.
Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement. 16(3), 143-152.
Sheehan, K., & Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement. 16. 65-76.
Shepard, L., & Camilli, G. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational Statistics. 6(4), 317-375.
Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement r 28. 237-247.
Somes, G. W. (1986). The generalized Mantel-Haenszel statistic. The American Statistician. 40. 106-108.
SPSS Inc. (1990). SPSS reference guide. Chicago: Author.
![Page 123: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/123.jpg)
116
Stahl, J. A., & Lunz, M. E. (1993). Assessing the extent of overlap of items among computerized adaptive tests. Paper presented at the Annual Meeting of AERA, Atlanta, GA.
Steinberg, L., Thissen, D., & Wainer, H. (1990). Validity. In Wainer, H. (Ed.) Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.
Swaminathan, H., & Rogers, H. J. (1990). Detecting "differential item functioning using logistic regression procedures. Journal of Educational Measurement. 27, 361-370.
Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika. 49. 501-519.
Thissen, D. & Steinbern, L. (1986). a taxonomy of item response models. Psychometrika. 51. 567-577.
Thissen, D. , Steinberg, L. & Mooney, J. (1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement. 26, 247-260.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum Associates.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory. Journal of Educational Measurement. 14, 182-196.
Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement. 26, 191-208.
Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practices. 15-20.
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement. 24/ 185-201.
![Page 124: 37? /VS/J S*o. 297 - UNT Digital Library/67531/metadc279043/... · Kiely, 1987). Adaptive testing is a specific area where testlets are beneficial. Many of the psychometric problems](https://reader034.fdocuments.net/reader034/viewer/2022051914/6005df17d3013f708036dc0d/html5/thumbnails/124.jpg)
117
Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement. 27, 1-14.
Wainer, H., Kaplan, B., & Lewis, C. (1992). A comparison of the performance of simulated hierarchical and linear testlets. Journal of Educational Measurementf 29., 243-251.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement. 28. 197-219.
Wainer, H., Lewis, C., Kaplan, B., & Braswell, J. (1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement. 28, 311-324.
Wainer, H., Dorans, N., Flaugher, R., Green, B., Mislevy, R., Steinberg, L., & Thissen, D. (1990). Computerized adaptive testing; A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.
Walpole, R. E., & Myers, R. H. (1989). Probability and statistics for engineers and scientists: Fourth edition. New York: Macmillan.
Weiss, D. J. (Ed.) (1983). New horizons in testing. New York: Academic Press.
Weiss, D. J., & Yoes, M. E. (1991). Item response theory. In R.K. Hambleton & J.N. Zaal (Eds.), Advances in educational and psychological testing: Theory and applications. Boston: Kluwer Academic.
Welch, C., & Hoover, H. D. (1993). Procedures for extending item bias detection techniques to polytomously scored items. Applied Measurement in Education. 6(1),1-19.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: Mesa Press.
Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement. 30f 233-251.