Development and evaluation of a computer adaptive test of personality: the basic traits inventory

COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS/ DISSERTATION

o Attribution — You must give appropriate credit, provide a link to the license, and indicate ifchanges were made. You may do so in any reasonable manner, but not in any way thatsuggests the licensor endorses you or your use.

o NonCommercial — You may not use the material for commercial purposes.

o ShareAlike — If you remix, transform, or build upon the material, you must distribute yourcontributions under the same license as the original.

How to cite this thesis

Surname, Initial(s). (2012) Title of the thesis or dissertation. PhD. (Chemistry)/ M.Sc. (Physics)/ M.A. (Philosophy)/M.Com. (Finance) etc. [Unpublished]: University of Johannesburg. Retrieved from: https://ujcontent.uj.ac.za/vital/access/manager/Index?site_name=Research%20Output (Accessed: Date).

http://www.uj.ac.za/

https://ujdigispace.uj.ac.za/

DEVELOPMENT AND EVALUATION OF A COMPUTER ADAPTIVE

TEST OF PERSONALITY: THE BASIC TRAITS INVENTORY

by

PAUL P. VORSTER

200603252

Thesis submitted in fulfilment

of the requirements for the degree

DOCTORATE IN PHILOSOPHY Industrial Psychology

in the

FACULTY OF MANAGEMENT

at the

Department of Industrial Psychology and People Management

Supervisor: Professor Gideon P. de Bruin

2

ABSTRACT

Background: Recent developments in technology have made the creation of computer adaptive

tests of personality a possibility. Despite the advances promised by computer adaptive testing,

personality testing has lagged behind the ability testing domain regarding computer adaptive

testing and adaptation. A principal reason why personality tests have not enjoyed computer

adaptive adaptation is because few working computer adaptive tests are available for study or

comparison to their original fixed form counterparts. In addition, personality tests tend to be

predominantly based on classical test theory, whereas item response theory is required for the

development of a computer adaptive test. Despite these impediments, numerous attitudinal

measures have been adapted to function as computer adaptive tests and have demonstrated good

psychometric properties and equivalence to their fixed form counterparts. As computer adaptive

testing holds numerous advantages both psychometrically and practically, the development of a

computer adaptive personality test may further advance psychometric testing of personality.

Research Purpose: This study aimed to address the lack of progress made in the field of computer

adaptive personality testing through the evaluation and simulated testing of a hierarchical

personality inventory, namely the Basic Traits Inventory (BTI), within a computer adaptive test

framework. The research aimed to demonstrate the process of computer adaptive test preparation

and evaluation (study 1 and study 2); as well as the simulation of the scales of the BTI as computer

adaptive tests (study 3). This was conducted to determine whether the BTI scales could be used as

computer adaptive tests, and to determine how the BTI computer adaptive scales compare to their

fixed form counterparts.

Research Design: A sample of 1962 South African adults completed the BTI for selection,

development and career counselling purposes. The instrument was investigated on a scale by scale

basis with specific emphasis placed on scale dimensionality (study 1) and scale fit to the one-

dimensional Rasch item response theory model (study 2). These factor analytic and item response

3

theory evaluations were necessary to determine the suitability of the BTI scales for computer

adaptive testing as well as prepare the BTI for computer adaptive test simulation. Poor performing

items were removed and a set of ‘core’ items selected for computer adaptive testing. Finally, the

efficiency, precision, and equivalence of the person parameters generated by the computer

adaptive core scales, as simulated in a computer adaptive framework, were compared to their non-

adaptive fixed form counterparts to determine their metric equivalence and functioning (study 3).

Main Findings:

Study 1: The initial evaluation of dimensionality of the BTI scales indicated that the orthogonal

bifactor model was the best fitting dimensional model for the BTI scales. The scales of the BTI

was therefore not strictly unidimensional, but rather was composed of a dominant general factor

with some group factors (the facets) accounting for unique variance beyond the general factor.

Except for Extraversion, all other scales of the BTI evidenced general factor dominance, which

indicated that a total score could be interpreted for at least four of the five BTI scales. This total

score interpretation at the scale level allows the BTI to be used computer adaptively on the scale

(general factor) level. Although Excitement Seeking accounted for unique variance beyond the

general factor, the facet was still included when the scale was fit to the Rasch model.

Study 2: A total of 59 items were flagged for removal following fit to the one-dimensional Rasch

model. These items were flagged for removal because they did not fit the Rasch rating scale model

effectively or because the items demonstrated either uniform or non-uniform DIF by ethnicity

and/or gender. Item parameters were also generated for the shortened and optimised BTI scales

(core scales) for computer adaptive adaptation. In general, all the scales of the BTI fit the Rasch

model well after flagged items were removed which justified the inclusion of the Excitement

Seeking facet in the Extraversion scale for computer adaptive testing.

Study 3: The optimised computer adaptive ‘core’ BTI scales used on average 50 – 67% fewer

items when compared to their fixed form non-computer adaptive counterparts during computer

4

adaptive test simulation. Person parameter estimates were estimated at or below the standard error

criterion of .33 which indicate rigorous measurement precision. The BTI scales also demonstrated

strong correlations to their non-adaptive full form counterparts with correlations ranging between

.89 (Extraversion) to .94 (Neuroticism).

Summary and Implications: It is possible for a standard non-computer adaptive test of

personality to be converted into a computer adaptive test without compromising the psychometric

properties of the instrument. Study 1 and 2 were evaluative and helped to prepare the BTI scales

for computer adaptive test application. The final study indicated that good scale preparation results

in better equivalence between the computer adaptive and fixed form non-computer adaptive tests.

Additionally, a lower standard error of person parameter estimation as well as greater item

administration efficiency was attained by the ‘prepared’ item banks. Although future research

should take into consideration test-mode differences and content balancing of subscales, the

research demonstrates that a computer adaptive test of personality can be as precise, reliable, and

accurate, while being more efficient, than their fixed form non-computer adaptive counterparts.

5

TABLE OF CONTENTS

ABSTRACT ............................................................................................................................... 2

ACKNOWLEDGEMENTS ....................................................................................................... 9

LIST OF TABLES ................................................................................................................... 10

LIST OF FIGURES ................................................................................................................. 12

CHAPTER 1: INTRODUCTION AND ORIENTATION TO THE STUDY ......................... 14

1.1. Introduction ....................................................................................................................14

1.2. The progress made in the computer adaptive testing of personality ..............................15

1.3. Overview of the present study ........................................................................................18

CHAPTER 2: BACKGROUND OF COMPUTER ADAPTIVE TESTING .......................... 20

2.1. Key terms in computer adaptive testing .........................................................................20

2.2. Equivalence of computer adaptive tests to non-computer adaptive tests .......................23

2.3. Classical test theory and item response theory ...............................................................27

2.3.1. The problem of mean error estimation in classical test theory .......................................30

2.3.2. The impact of the number of test items administered ....................................................32

2.3.3. Local independence, person-free items estimation, and item-free person estimation ....33

2.3.4. Measurement invariance .................................................................................................38

2.4. The advantages of computerized adaptive testing ..........................................................39

2.4.1. Increased relevance for test-takers .................................................................................40

2.4.2. Reduction of testing time................................................................................................41

2.4.3. Reducing the burden of testing .......................................................................................42

2.4.4. Immediate feedback after testing....................................................................................42

2.4.5. Testing is not limited to one setting ...............................................................................43

2.4.6. Greater test security ........................................................................................................43

2.4.7. Invariant measurement ...................................................................................................44

2.4.8. Error estimates for each item ..........................................................................................44

2.4.9. Advancement of psychometric testing in general ..........................................................44

2.5. The requirements for the development of computer adaptive tests ................................45

2.6. Preview of the contents of the following chapters .........................................................46

CHAPTER 3: THE DIMENSIONALITY OF THE BTI SCALES ......................................... 48

3.1. Introduction ....................................................................................................................48

6

3.1.1. Testing the dimensionality of hierarchical personality scales ........................................49

3.1.2. Evaluation of the dimensionality of hierarchical personality scales ..............................52

3.1.3. The Basic Traits Inventory (BTI) ...................................................................................54

3.2. Method ............................................................................................................................55

3.2.1. Participants…………………………………………………………………………….55

3.2.2. Instrument……………………………………………………………………………...56

3.2.3. Data Analysis..................................................................................................................56

3.2.4. Ethical Considerations ....................................................................................................58

3.3. Results ............................................................................................................................58

3.3.1. Fit indices for the bifactor model (Model3) ...................................................................60

3.3.2. Reliability of the BTI scales ...........................................................................................61

3.3.3. Reliability of the BTI subscales .....................................................................................62

3.3.4. The bifactor pattern matrix .............................................................................................64

3.4. Discussion.......................................................................................................................66

3.4.1. The fit of the bifactor model ...........................................................................................66

3.4.2. The dimensionality of the BTI scales .............................................................................67

3.4.3. Implications for fit to one-dimensional item response theory models ...........................68

3.5. Overview of Chapter 3 and a preview of Chapter 4 ..................................................69

CHAPTER 4: FITTING THE BTI SCALES TO THE RASCH MODEL ............................ 70

4.1. Introduction ....................................................................................................................70

4.1.1. The use of the Rasch model for computer adaptive test development ...........................71

4.1.2. The application of Rasch diagnostic criteria for psychometric evaluation ....................73

4.2. Method ............................................................................................................................75

4.2.1. Participants …………………………………………………………………………….75

4.2.2. Instrument……………………………………………………………………………...75

4.2.3. Data Analysis..................................................................................................................76

4.2.4. Ethical Considerations ....................................................................................................78

4.3. Results ............................................................................................................................82

4.3.1. BTI scale infit and outfit statistics ..................................................................................82

4.3.2. Person separation and reliability indices ........................................................................86

4.3.3. Rating scale performance ...............................................................................................87

4.3.4. Differential item functioning ..........................................................................................92

4.3.5. Criteria for item exclusion from the core item bank ......................................................98

7

4.3.6. Functioning of the ‘core’ BTI scales ..............................................................................99

4.3.7. Cross-plotting person parameters of the full-test and the reduced test scales ..............113

4.4. Discussion.....................................................................................................................118

4.4.1. Rasch rating scale model fit .........................................................................................119

4.4.2. Item spread and reliability ............................................................................................119

4.4.3. Rating scale performance .............................................................................................120

4.4.4. DIF by ethnicity and gender .........................................................................................121

4.4.5. Conclusion…………………………………………………………………………….121

4.5. Overview of the current chapter and preview of the forthcoming chapter ...................123

CHAPTER 5: AN EVALUATION OF THE COMPUTER ADAPTIVE BTI ..................... 125

5.1. Introduction ..................................................................................................................125

5.1.1. Computer adaptive test simulation ...............................................................................126

5.1.2. Item banks used in computer adaptive testing ..............................................................128

5.1.3. Computer adaptive testing ............................................................................................129

5.2. Method ..........................................................................................................................142

5.2.1. Participants……………………………………………………………………………142

5.2.2. Instrument…………………………………………………………………………….142

5.2.3. Data Analysis................................................................................................................143

5.2.4. Ethical Considerations ..................................................................................................150

5.3. Results ..........................................................................................................................150

5.3.1. Comparing person parameter estimates of the different BTI scales .............................151

5.3.2. Computer adaptive core test performance indices........................................................167

5.4. Discussion.....................................................................................................................179

5.4.1. Correlations between person parameter estimates of the various adaptive and non-adaptive

test forms……………………………………………………………………………...180

5.4.2. Adaptive core and adaptive full performance indices ..................................................183

5.4.3. Item usage statistics ......................................................................................................185

5.4.4. Implications for computer adaptive testing of personality ...........................................185

5.4.5. Recommendations for future research ..........................................................................186

5.4.6. Conclusion and final comments ...................................................................................189

CHAPTER 6: DISCUSSION AND CONCLUSION ............................................................ 191

6.1. Introduction ..................................................................................................................191

6.1.1. Aims and objectives of the three studies ......................................................................191

8

6.1.2. Study 1 objectives: The dimensionality of the BTI scales ...........................................192

6.1.3. Study 2 objectives: Fitting the BTI scales to the Rasch model: Evaluation and selection of a

core item bank for computer adaptive testing ..............................................................193

6.1.4. Study 3 objectives: An evaluation of the simulated Basic Traits Inventory computer adaptive

test…………………………………………………………………………………….193

6.2. Discussion of Results for the Three Studies .................................................................194

6.2.1. Study 1 results: The dimensionality of the BTI scales .................................................194

6.2.2. Study 2 results: Fitting the BTI scales to the Rasch model ..........................................197

6.2.3. Study 3 results: An evaluation of the computer adaptive BTI .....................................198

6.3. Limitations and suggestions for future research ...........................................................199

6.4. Implications for practice ...............................................................................................201

6.5. Conclusion ....................................................................................................................202

REFERENCES ...................................................................................................................... 203

APPENDIX A: ITEM USAGE STATISTICS FOR THE ADAPTIVE FULL AND

ADAPTIVE CORE TEST VERSIONS ................................................................................. 227

APPENDIX B: MAXIMUM ATTAINABLE INFORMATION WITH SUCCESSIVE ITEM

ADMINISTRATION ............................................................................................................. 232

APPENDIX C: NUMBER OF ITEMS ADMINISTERED ACROSS THE TRAIT

CONTINUUM ....................................................................................................................... 237

9

ACKNOWLEDGEMENTS

I am deeply indebted to the following people and institutions:

Professor G. P. de Bruin, thank you Professor for all the kind, and hard, words and for

all the support you have given me. I am greatly indebted to you for making this possible. I have

learned so much from you and I hope that I can add to our field and make you proud.

I dedicate this doctoral thesis to two people who have walked the hard miles with me. Firstly,

to my mother Maxine Vorster, without your help, guidance, support and love I would never

have found the courage to complete this endeavour. Thank you for standing by me through

this adventure called ‘life’.

Secondly, but by no means second, thank you Marié Minnaar for standing by me and

giving up our quality time so that I could complete this work. You are truly the love of my life

and without that love I would have been lost. Thank you from the deepest part of my heart.

You are my touchstone.

I would also like to specially thank Professor Freddie Crous. Thank you Professor for

the words of encouragement and the hope you have instilled in me to be the best I can be.

Thank you for always being interested and encouraging. It has meant more to me than you will

ever know.

Finally, I would like to thank Dr. Nicola Taylor and Dr. Brandon Morgan for their

constant assistance and support. Without the two of you I would not have had the motivation

to embark and complete this endeavour. Thank you both for being not only fantastic colleagues,

but good friends.

I would also like to give a final thanks to the Centre for Work Performance and the

Department of Industrial Psychology and People Management. Thank you for your support,

both academically and financially.

10

LIST OF TABLES

Table 3.1 Three confirmatory factor models of the structure of the BTI scales 59

Table 3.2 Chi-square difference test of the three factor models 60

Table 3.3 Proportion of specific and total variance explained by factors and facets of the

BTI

62

Table 3.4 Standardised factor loadings for Neuroticism (Model 3) 64

Table 4.1 Item and person mean summary infit and outfit statistics 82

Table 4.2 Item infit mean squares for the BTI scales 83

Table 4.3 Item outfit mean squares for the BTI scales 85

Table 4.4 Person separation and reliability indices 87

Table 4.5 Rating scale performance indices 88

Table 4.6 Practically significant DIF by ethnicity 93

Table 4.7 Practically significant DIF by gender 96

Table 4.8 Item and person mean summary infit and outfit statistics after item removal 100

Table 4.9 Item infit statistics for the scales of the BTI after flagged items were removed 101

Table 4.10 Item outfit statistics for the scales of the BTI after flagged items were removed 103

Table 4.11 Person separation and reliability indices after item removal 105

Table 4.12 Rating scale performance indices after item removal 106

Table 4.13 Practically significant DIF by ethnicity 110

Table 4.14 Practically significant DIF by gender 112

Table 5.1 Correlations between test-form person parameter estimates for the BTI

Extraversion scale

151


Neuroticism scale

155

11


Conscientiousness scale

158


Openness scale

161


Agreeableness scale

164

Table 5.6 Performance indices of the adaptive core and full item banks 168

Table 5.7 Percentage of items not administered by the adaptive test versions 178

12

LIST OF FIGURES

Figure 4.3.3a Person/item distribution for the Extraversion scale 90

Figure 4.3.3b Person/item distribution for the Neuroticism scale 90

Figure 4.3.3c Person/item distribution for the Conscientiousness scale 91

Figure 4.3.3d Person/item distribution for the Openness scale 91

Figure 4.3.3e Person/item distribution for the Agreeableness scale 92

Figure 4.3.5.3a Person/item distribution for the core Extraversion scale 107

Figure 4.3.5.3b Person/item distribution for the core Neuroticism scale 108

Figure 4.3.5.3c Person/item distribution for the core Conscientiousness scale 108

Figure 4.3.5.3d Person/item distribution for the core Openness scale 109

Figure 4.3.5.3e Person/item distribution for the core Agreeableness scale 109

Figure 4.3.6a Cross plot of person measures for the full and core Extraversion

scales

114

Figure 4.3.6b Cross plot of person measures for the full and core Neuroticism

scales

115

Figure 4.3.6c Cross plot of person measures for the full and core

Conscientiousness scales

115

Figure 4.3.6d Cross plot of person measures for the full and core Openness scales 116

Figure 4.3.6e Cross Plot of Person Measures for the Full and Core Agreeableness

Scales

116

Figure 5.3.1.1 Cross plot of person measures for the adaptive core and non-

adaptive full scales of Extraversion

153

13

Figure 5.3.1.2

Cross plot of person measures for the adaptive core and non-

adaptive full scales of Neuroticism

156


adaptive full scales of Conscientiousness

160


adaptive full scales of Openness

163


adaptive full scales of Agreeableness

165

Figure 5.6a Maximum attainable information contributed with each item

administered for the Extraversion full item bank

172

Figure 5.6b

Maximum attainable information contributed with each item

administered for the Extraversion core item bank.

173

Figure 5.11a Number of items administered across the trait continuum for the

Extraversion full item bank

175

Figure 5.11b Number of items administered across the trait continuum for the

Extraversion core item bank

176

14

CHAPTER 1: INTRODUCTION AND ORIENTATION TO THE STUDY

“Computer adaptive testing…a methodology whose time has come?” – Michael Linacre

(2000, p.1)

1.1. Introduction

Personality measurement has to move forward. Psychologists and psychometricians

have for too long relied on non-adaptive non-computerised tests based on classical test

theory to measure personality (Crocker & Algina, 1986; McDonald, 1999; Weiss, 2004;

Zickar & Broadfoot, 2009). Paraphrasing Linacre (2000), computer adaptive testing is a

methodology that is ready to be applied to attitudinal inventories (such as personality

inventories). Although computerised adaptive testing has enjoyed widespread attention

in some psychometric testing domains, such as ability testing; personality testing has not

shared the same progress (Forbey & Ben-Porath, 2007; Forbey, Ben-Porath, & Arbisi,

2012; Hol, Vorst, & Mellengergh, 2008; Hsu, Zhao, & Wang, 2013).

To illustrate the slow progress of computer adaptive personality testing a

comparison must be made with computer adaptive ability testing. The first adaptive test

to make use of computer technology was the Armed Services Vocational Aptitude Battery

or ASVAB (de Ayala, 2009). The ASVAB was ready for computer adaptive testing in

1979 (Gershon, 2004). Although computer technology still needed to progress for the test

to be widely implemented it was, for all intents and purposes, ready for practical testing

(Gershon, 2004). In contrast, computer adaptive personality testing has not yet entered

the practical testing arena (Hsu et al., 2013) with limited simulated computer adaptive

personality test versions – computer adaptive tests that use non-adaptive responses to

15

simulate adaptive testing – available, with none used in praxis (Stark, Chernyshenko,

Drasgow, & White, 2012).

Some of the reasons for the slow development of computer adaptive tests of

personality include the expense of computer based testing; the limited spread of the

internet; the relative novelty of the form of testing; the lack of an integrated and universal

model of personality; and the complexity of measuring personality in a computer adaptive

manner (McCrae, 2002; Ortner, 2008; Stark et al., 2012). However, advances have been

made in these areas and many of these challenges have been overcome making computer

adaptive personality testing attainable (Linacre, 2000; Ortner, 2008; Rothstein & Goffin,

2006).

Despite the challenges faced in the development of computer adaptive tests of

personality, some limited progress has been made. The following section in this chapter

will briefly outline and report on this progress.

1.2. The progress made in the computer adaptive testing of personality

Although computer adaptive personality testing has made some progress; this

progress has been compromised by general lack of application to praxis. For example, the

first personality test to become computer adaptive was the California Psychological

Inventory (CPI) in 1977 (Ben-Porath & Butcher, 1986). Unfortunately, the costs of

computer technology, at the time, made the widespread and un-simulated use of this test

unfeasible, and thus the CPI computer adaptive version fell into relative obscurity (Ben-

Porath & Butcher, 1986). Thankfully, the computer adaptive adaptation of the CPI

garnered the attention of researchers that stimulated interest in the field of computer

adaptive personality testing.

16

Only a year after the computer adaptive CPI was developed, Kreitzberg, Stocking and

Swanson (1978) published an early article entitled Computerized Adaptive Testing:

Principles and Directions. This article proposed that the future of ability, clinical and

personality testing would reside in the computer adaptive domain (Kretizberg et al.,

1978). The authors suggested that the newly applied – at the time – item response theory

model could be used to develop item-banks from which items would be administered to

test-takers in an adaptive manner using computer technology (Kreitzberg et al., 1978).

These authors argued that computer adaptive tests were more efficient – using fewer

items – while still being capable of rigorously estimating the test-taker’s standing on the

latent construct being measured in a fair and reliable manner (Kreitzberg et al., 1978).

Although Kreitzberg et al. (1978) created some theoretical impetus for computer adaptive

testing, little progress was made until the Minnesota Multiphasic Personality Inventory

(MMPI) was analysed using a one-parameter item response theory model for computer

adaptive testing applications (Ben-Porath & Butcher, 1986; Carter & Wilkinson, 1984).

Unfortunately, the technology to make the computer adaptive version of the MMPI

feasible was not yet readily available at the time (Ben-Porath & Butcher, 1986) and this

computer adaptive instrument, like the computer adaptive CPI, fell into obscurity.

With the publication of Computers in Personality Assessment: A Brief Past, an

Ebullient Present, and Expanding Future by Ben-Porath and Butcher (1986), the case

was once again made for the computer adaptive adaptation and development of

personality inventories. Unfortunately, this article reported little practical progress in the

domain, although theoretical progress was slowly being made. Luckily, advances in

computer technology and the development of the five factor model of personality greatly

facilitated the computerisation of personality testing in industry in the 1990’s, albeit in a

non-adaptive manner (Digman, 1989; Goldberg, 1992; Stark et al., 2012).

17

The constraints imposed on the development of computer adaptive tests of

personality were thus starting to lift (Rothstein & Goffin, 2006). However, by 2006

literature still reported on the lack of progress in the computer adaptive personality

domain (Rothstein & Goffin, 2006) with only the MMPI and NEO-PI-R making any

progress in the field (Forbey, Handel, & Ben-Porath, 2000; Reise & Henson, 2000).

Currently, on the international level, the Graduate Record Examination (GRE);

Graduate Management Admission Test (GMAT) and the Test of English as a Foreign

Language (TOEFL) are the most widely used computer adaptive tests with an estimated

11 million users every year (Kaplan & Saccuzzo, 2013). The GRE, GMAT and TOEFL

are ability based and illustrate how far personality testing is lagging behind ability testing

in the computer adaptive domain.

A very similar trend could be seen in South Africa where only computer adaptive

tests of ability were developed for use in praxis such as the General Scholastic Aptitude

Test (GSAT) and the Learning Potential Computerised Adaptive Test (LPCAT)

(Claassen, Meyer, & Van Tonder, 1992; de Beer, 2005). Unfortunately, no computer

adaptive personality tests have been developed or investigated for use in South Africa to

date. However, Hobson (2015) recently developed a computer adaptive test of the Self-

Control subscale of the Trait Emotional Intelligence Questionnaire (TEIQue) in South

Africa. As this inventory is trait-based, it shares a similar item structure to standard

personality inventories.

With the continued growth of personality testing (Stark et al., 2012); the need to

make assessments shorter and more relevant for test-takers (Haley, Ni, Hambleton,

Slavin, & Jette, 2006), and the exponential growth of computer technology; the time for

computer adaptive personality testing has come. It is therefore surprising that a recent

18

article by Simms, Goldberg, Roberts, Watson, Welte and Rotterman (2011) still refer to

a lack of progress in the computer adaptive personality test domain.

One of the possible reasons for the lack of development and research on computer

adaptive personality tests may be because no such inventories are readily available for

investigation and scrutiny (Forbey & Ben-Porath, 2007). Although numerous studies have

investigated whether personality tests can effectively fit item response theory models,

which are an essential requirement for computer adaptive testing, not many have

investigated the functioning of personality inventories using an actual computer adaptive

framework (Forbey & Ben-Porath, 2007). Therefore, most studies focusing on the

computer adaptive testing of personality are feasibility oriented and refrain from

evaluating the psychometric properties of personality tests in their computer adaptive

format (Forbey & Ben-Porath, 2007). The lack of pervasive validity evidence through

computer adaptive simulation/testing of personality inventories have contributed

substantially to the slow progress in this domain.

The lack of validity evidence for computer adaptive tests of personality is a core

area addressed by this study. The next section will give an overview of the objectives of

the present study and how a computer adaptive test of personality will be evaluated for

use in the South African context.

1.3. Overview of the present study

This study aimed to address the lack of progress made in the field of computer

adaptive personality testing through the evaluation and testing of a hierarchical

personality inventory, namely the Basic Traits Inventory or BTI, within a computer

adaptive framework. This was accomplished through completion of three independent

studies which are discussed in Chapter 3, 4, and 5 respectively. The psychometric

19

properties of the BTI were systematically evaluated so that the instrument could be

prepared for computer adaptive testing applications in Chapters 3 and 4 respectively.

After initial psychometric evaluation and preparation, the revised version of the BTI was

simulated as a ‘running’ computer adaptive test within a computer adaptive testing

framework in Chapter 5. Additionally, in Chapter 5 the psychometric properties and

efficiency of the computer adaptive BTI was compared to its non-computer adaptive

counterpart to determine whether they were psychometrically equivalent.

In the next chapter (Chapter 2) key terms in computer adaptive testing is defined

and the statistical measurement models on which such testing depends was discussed. The

equivalence of computer adaptive tests when compared to their non-computer adaptive

counterparts is also explored in Chapter 2. Finally, a process for the development and

evaluation of a computer adaptive test of personality was presented.

20

CHAPTER 2: BACKGROUND OF COMPUTER ADAPTIVE TESTING

“Administer an item that is much too hard, and the candidate may immediately fall into

despair, and not even attempt to do well” – Michael Linacre (2000, p.5)

2.1. Key terms in computer adaptive testing

A distinction has to be made between computer based, adaptive, and computer

adaptive testing because these terms are often confused (Triantafillou, Georgiadou, &

Economides, 2008). Computer based testing refers to the mode of testing whereas

adaptive and non-adaptive testing refers to the testing strategy employed (Triantafillou et

al., 2008).

A computer based test is usually completed by a test-taker via computer

(Triantafillou et al., 2008) and may be either adaptive, or non-adaptive in nature

(Thompson & Weiss, 2011). This makes computer based testing distinct from computer

adaptive testing which is both computer based and adaptive (Thompson & Weiss, 2011).

In an adaptive test the test-taker is given items that closely approximate his/her

ability or trait level (Linacre, 2000; Thompson & Weiss, 2011). This does not mean that

adaptive tests need to always be computer-based. For example, Alfred Binet as early as

1905 adaptively administered items on the Binet-Simon intelligence test to test-takers of

varying ability levels by rank ordering the difficulty of items and then subjectively

deciding, based on the performance of the test-taker, which items were most appropriate

to administer (Linacre, 2000). In computer adaptive testing the computer through the use

of item selection algorithms administers the most appropriate item for the currently

estimated ability/trait level of the test-taker (Lai, Cella, Chang, Bode, & Heinemann,

2003). Therefore, in computer adaptive testing, the computer, not the examiner, selects

21

the most appropriate items for administration. Conversely, non-adaptive tests present the

test-taker with all the items in the item-bank, in the same order, regardless of the test-

takers currently estimated ability or trait level (Walker, Bӧhnke, Cerny & Strasser, 2010).

Therefore, non-adaptive tests may or may not be computer-based depending on

the nature of its administration (Thompson & Weiss, 2011). A test is therefore only

considered computer adaptive if it is (1) computer based, (2) adaptive, and (3) when the

computer selects the items deemed most relevant for the test-taker’s ability or trait level

through the use of pre-specified computer algorithm (Dodd, de Ayala, & Koch, 1995;

Gershon, 2004; Simms et al., 2011).

Another aspect of computer adaptive testing that is not well understood is item

difficulty, in the case of ability testing, and item endorsability, in the case of a statement

to a self-report item. Item difficulty refers to the probability that a test-taker will respond

correctly to an item whereas item endorsability refers to the probability that a test-taker

will respond affirmatively (to some greater or lesser degree in the case of Likert-type

scales) to a statement of a self-report item (de Ayala, 2009; Thompson & Weiss, 2011).

This is why item response theory models which calculate logarithmically the probability

of answering an item correctly (or endorsing an item) are so important in the development

of computer adaptive tests (Linacre, 2000). Most important to self-report measures is the

use of partial credit models where test-takers are given partial credit for responses on

Likert-type scales (Masters, 1982; Verhelst & Verstralen, 2008).

In this way, test-takers with a high ability/trait level should have a higher

probability of endorsing items than test-takers with a lower ability/trait level (Wauters,

Desmet, & Noortgate, 2010). In computer adaptive testing the computer determines

which items, with a relative difficulty/endorsability, will be administered to test-takers of

a particular estimated ability/trait level (Weiss, 2013) and vice versa. Therefore, test-

22

takers of a low ability/trait level will be given more items that approximate this

ability/trait level, and fewer items that are of a greater difficulty or lower endorsability

(Weiss, 2013).

Consequently, no test-taker will be given exactly the same test – with the same

items – as the relative estimated ability/trait level of the test-taker and

difficulty/endorsability of items will be approximately matched for each unique

individual (Thompson & Weiss, 2011).

Consequently, item difficulty/endorsability is matched adaptively to the estimated

ability/trait level on the construct of the test-taker (Meijer & Nering, 1999) and only

enough items are administered to determine his/her ability or trait level with sufficient

precision (de Ayala, 2009). Therefore, potentially numerous items, which provide

relatively little information about the test-taker’s ability or trait level, are left-out,

shortening the test (Thompson & Weiss, 2011). This also makes the test optimally

relevant for the test-takers as they are only exposed to the items that match, or

approximate, their estimated ability/trait level throughout the adaptive testing process

(Haley et al., 2006). This is why computer adaptive tests tend to be more efficient and

relevant for test-takers than their non-adaptive counterparts and are thus favourably

presented in the literature (Wang & Kolen, 2001).

However, the drastic movement away from classical test theory to item response

theory, which is a required measurement model for the development and use of computer

adaptive tests, has raised some doubt about the equivalence of computer adaptive tests

when compared to their well-researched and evaluated non-computer adaptive

counterparts (Ortner, 2008). These doubts have impeded the progress of computer

adaptive personality testing (Ortner, 2008). It is therefore logical to assume that computer

adaptive tests of personality would only be widely accepted and implemented in praxis if

23

they are shown to be psychometrically equivalent to their non-computer adaptive

counterparts. More importantly, computer adaptive personality tests should also exceed

their non-computer adaptive counterparts by increasing test efficiency and overcoming

obstacles encountered in ‘classical’ personality testing. Thankfully, numerous studies

have reported on the psychometric properties of computer adaptive tests in the non-ability

testing domain (Betz & Turner, 2011; Chien, Wu, Wang, Castillo, & Chou, 2009; Forbey

& Ben-Porath, 2007; Gibbons et al., 2008; Hol et al., 2008; Pitkin & Vispoel, 2001; Reise

& Henson, 2000). Unfortunately, only a limited number of these studies report on the

psychometric properties of computer adaptive versions of personality inventories.

Nevertheless, these studies are useful additions to the literature of the current study

because they report on attitudinal measures, which in the tecnical sense, are

indistinguishable from personality testing. Consequently, these studies are reviewed in

the next section.

2.2. Equivalence of computer adaptive tests to non-computer adaptive tests

A meta-analysis by Pitkin and Vispoel (2001) reported on fifteen peer-reviewed

articles that pertain to the properties of computer adaptive tests for self-report (attitudinal)

inventories. This meta-analysis indicated that these computer adaptive tests administered

approximately half the items that non-computer adaptive tests administered while still

reliably, validly and precisely measuring the constructs under consideration (Pitkin &

Vispoel, 2001). In addition, an average internal consistency reliability coefficient of .87

for the computer adaptive tests reviewed was reported indicating a high consistency of

measurement for computer adaptive versions of non-adaptive tests (Pitkin & Vispoel,

2001).

24

Additionally, Gibbons et al. (2008) reported a 96% reduction in the number of items

used to estimate the mood and anxiety levels of outpatient groups using a computer

adaptive version of the Mood and Anxiety Spectrum Scales (MASS). This study reported

a correlation > .90 between the shortened computer adaptive scale of the MASS and the

non-adaptive full scale (Gibbons et al., 2008). These results indicated that the computer

adaptive MASS measured the same constructs using fewer items than its non-adaptive

counterpart.

In another computer adaptive test evaluation of an attitudinal measure, Hol et al.

(2008) reported using only 78% of the original items of the Dominance scale of the

Adjective Checklist, or ACL, while achieving a rigorous standard error of person

parameter estimation of <.30 logits. Even though fewer items were administered, the ACL

computer adaptive version still correlated .99 with the latent trait estimates of the full

scale (Hol et al., 2008).

Similarly, Betz and Turner (2011) reported using 25% of the original 100 items of

the Career Confidence Inventory (CCI) within a standard error of person parameter

estimation of <.40, which is considered rigorous. These authors also reported a strong

correlation (.93) between the ability or trait estimates of the computer adaptive and non-

computer adaptive versions of the CCI, thus demonstrating that the computer adaptive

CCI could reliably, precisely and accurately measure the same constructs as its non-

computer adaptive counterpart.

Reise and Henson (2000) in an attempt to computerise the NEO-PI-R found that the

computer adaptive version of the test used only 50% of the original number of items to

estimate personality traits effectively. Out of the thirty facets of the NEO-PI-R, which are

composed of eight items per facet (with a total number of 240 items) the computer

adaptive version used on average four items per facet, for all the facets, to estimate test-

25

takers’ standing on the latent personality traits (Reise & Henson, 2000). This amounted

to using half the items of the total test. Despite using fewer items the correlation between

trait scores of the computer adaptive and non-computer adaptive version of the NEO-PI-

R was >.91 for all items (Reise & Henson, 2000). This study should be of particular

interest because it demonstrates that a computer adaptive version of a personality scale

can be as precise, accurate, and reliable (and more efficient) than its non-computer

adaptive counterpart.

Computer adaptive tests have also been used in clinical settings as there is a greater

need for instruments to be shorter and more relevant to ease the burden on patients (Chien

et al., 2009). For example, Chien et al. (2009) compared the simulated computer adaptive

version of the Activities of Daily Living (ADL) inventory, which measures how easily

patients are able to complete simple day-to-day activities, with its non-computer adaptive

counterpart. The computer adaptive version of the ADL administered 13.42 items to test-

takers on average whereas the non-computer adaptive version of the ADL administered

all 23 items to test-takers. Additionally, the study found no significant mean differences

between the responses on the computer adaptive version in comparison to the non-

computer adaptive version of the ADL alluding to measurement equivalence between the

two instruments (Chien et al., 2009).

One of the most famous and widely used clinical tools, the MMPI-2, is another

instrument that has garnered interest in the computer adaptive testing domain (Forbey &

Ben-Porath, 2007). Both the computer adaptive version of the MMPI-2 and the non-

computer adaptive version were compared for length and equivalence of measurement

(Forbey & Ben-Porath, 2007). On average, the computer adaptive version of the MMPI-

2 used between 21.6% and 17.5% fewer items to successfully and accurately estimate the

standing of test-takers on its clinical personality scales. Additionally, the computer

26

adaptive version of the MMPI-2 was determined to be demonstrably valid and reliable in

spite of its reduced length (Forbey & Ben-Porath, 2007).

Personal computers are not the only format where computer adaptive tests have been

employed. Computer adaptive tests have also been used on smartphone mobile devices

(Triantafillou et al., 2008). Using an educational assessment converted to a computer

adaptive mobile test (CAT-MD), the authors reported a 22.7% increased test efficiency

while maintaining a robust error of measurement < .33 (Triantafillou et al., 2008). This

study doesn’t even take into account the possible advantages of this format of testing on

test-distribution and reach.

The Rasch partial credit model has also been used to develop a computer adaptive

test of the Centre of Epidemiological Studies – Depression (CES-D) scale (Smits,

Cuijpers, & van Straten, 2011). The findings of this study indicated a 33% decrease in the

number of items used with the computer adaptive version of the CES-D with a rigorous

maximum error of measurement of .40 (Smits et al., 2011).

In South Africa, Hobson (2015) found that the computer adaptive version of the

TEIQue Self-Control subscale correlated highly with the non-adaptive full form version

of the scale (.97) while using only about 10 of the 16 items estimate person trait levels.

As the TEIQue measures trait-emotional intelligence its item structure is relatively

identical to trait-based personality measures and thus indicates that computer adaptive

personality tests may be used effectively in the South African environment.

In summary, these studies indicate that a computer adaptive test of non-ability

constructs (or attitudinal measures) can be considered equivalent to non-computer

adaptive tests while using fewer items, which are more relevant for individual test-takers.

Psychometric properties of the computer adaptive versions of these tests also appear not

to be comprised with sufficient reliability and low measurement error reported.

27

With these studies in mind, researchers need to demonstrate that personality can

make use of a computer adaptive format while still being rigorously equivalent to their

non-computer adaptive counterparts. Such studies will generate the necessary impetus

required for the use of computer adaptive personality tests in praxis. Another motivation

for the development of computer adaptive tests of personality is that these tests are

incredibly efficient and rigorously accurate thus improving measurement in a general and

practical sense.

In order to understand why computer adaptive tests tend to be more efficient and

accurate than their non-computer adaptive counterparts the statistical theory used to

construct these tests was reviewed and investigated. In the following section a brief

overview is given of classical test theory, used for the construction of non-computer

adaptive tests, and item response theory, used for the construction of computer adaptive

tests, so that the advantages of item response theory for test construction, and computer

adaptive test applications, can be explained.

2.3. Classical test theory and item response theory

There has been much criticism about the use of classical test theory especially in

the wake of the development of item response theory (Zickar & Broadfoot, 2009). Where

computer adaptive tests make use of item response theory, most non-computer adaptive

tests make use of classical test theory (Embretson & Reise, 2000; Gershon, 2004;

Macdonald & Paunonen, 2002; Weiss, 2004; Zickar & Broadfoot, 2009).

Item response theory’s genesis can be traced to Fredrick Lord and Melvin

Novick’s Statistical Theories of Mental Test Scores (1968), and Georg Rasch’s seminal

work titled Probabilistic Models for Some Intelligence and Attainment Tests (1960).

28

These works engrained a new psychometric model for the development and

evaluation of tests (Embretson & Reise, 2000; Fisher, 2008b; Traub, 1997). Item response

theory and the family of Rasch Partial Credit models were the result of these seminal

contributions (Traub, 1997). These ‘new’ item response theory models challenged the

classical test theory canon and were referred to by Embretson and Hershberger (1999) as

the ‘new rules of measurement’.

These new rules of measurement hold numerous advantages for testing namely that

items can be matched appropriately to the ability/trait levels of test-takers; that fewer

items can be used while maintaining the inventory’s validity and reliability; that items

can be used independently from other items in the test; and that such items can be

independently implemented across groups of varying characteristics (Gershon, 2004).

Ultimately, these new rules of measurement also allow for adaptive testing, which the

rules of classical test theory cannot facilitate.

The reason why classical test theory cannot facilitate adaptive testing is because the

theory fundamentally differs from item response theory in both the evaluations and the

assumptions it holds. In particular, the way that classical test theory deals with

measurement error when compared to the item response theory is of special importance.

Traub (1997) explains that “Classical Test Theory is founded on the proposition that

measurement error, a random latent variable, is a component of the observed score

random variable” (p. 8). In other words, classical test theory is based on true-score theory

where the observed score of any number of persons, garnered on any number of items on

a scale, is equal to the true score with the addition of measurement error. Refer to the

equation below adapted from Osborne (2008, p. 3).

29

𝑋 = 𝑇 + 𝐸

Where:

X = the observed score for a scale;

T = the true score

E = the error associated with observation

Put another way, the true score is equal to the observed score minus the error associated

with measurement (see below).

𝑇 = 𝑋 − 𝐸

True score theory has numerous consequences for the measurement and estimation

of constructs as it is argued that measurement error has no shared variance with the true

score of any one measurement instance compared to another; and that the error of a

particular measurement instance is independent from the error of any other measurement

instance (Kline, 2005; Zickar & Broadfoot, 2009).

In other words, each test or scale has its own unique systematic error which is

different from the error of any other test/scale, even parallel forms of the same test or

scale (Fisher, 2008a). This error is calculated from the total test statistics and is considered

an average for all the items in a scale. This is unlike item response theory where the error

of measurement for each item is calculated and scrutinized individually (Kline, 2005).

We next discuss the disadvantages and the limiting effect on test flexibility that

mean error estimation creates for instruments based on classical test theory is discussed

next. In contrast we also discuss how item response theory can overcome some of the

limitations imposed by classical test theory assumptions.

30

2.3.1. The problem of mean error estimation in classical test theory

The way the error of measurement is estimated in classical test theory has

disadvantages for measurement because no test-scale measuring a particular construct can

be truly and objectively compared to any other test-scale measuring the same construct

with a different sample with alternate items, or alternate item-ordering (Gershon, 2004;

Traub, 1997). This is because each test-scale based on classical test theory has unique

systematic error, which makes true test-scale equivalence across different test-scales

impossible (de Ayala, 2009). Explained in another way, the error in one test-scale

influences test-scores to some degree, which include the overall mean measurement error,

and is not the same as the measurement error encountered in another test-scale even

though it may be measuring the same construct (Osborne, 2008). This error of

measurement also differs for the same test-scale applied to different samples of test-takers

as each sample has its own unique test-characteristics (Kline, 2005). Therefore, if only a

cluster of items of a full test-scale is used, instead of the whole test-scale, then the mean

standard error changes as it is approximated for the specific test-scale with its unique set

of items as a whole (Osborne, 2008). Consequently, certain items in the test-scale, which

may vary regarding their individual error of measurement depending on the sample tested,

may affect the mean standard error in an overall manner. In this way no two forms of a

test-scale based on classical test theory can really be considered equivalent because the

mean standard error is different and unique for each test-scale employed with a particular

sample (Fisher, 2008a; Kersten & Kayes, 2011).

Where a mean error is generated across many items in classical test theory, item

response theory calculates the error of measurement associated with each item of a test-

scale individually (Kersten & Kayes, 2011). This is advantageous as there is usually a

31

greater error of measurement at the extremes of a trait distribution than near its center

(Harvey & Hammer, 1999; Wauters et al., 2010).

Unfortunately, it is assumed in classical test theory that the test scores for test-takers

who fall on the extremes of the person distribution, with relevance to the latent construct

under investigation, have the same mean error of measurement as the results of test-takers

who fall in the central area of the distribution (Gershon, 2004). This is because the mean

error of measurement is an average that applies across all items of the test-scale in

classical test theory (Kline, 2005). This assumption is imprudent as each item in a test-

scale has a particular endorsability/difficulty that is more, or less, suited to test-takers

with different trait/ability levels. Matching items with a particular endorsability/difficulty

to the trait/ability level of test-takers reduces the error of measurement whereas the

opposite is true of items that poorly target trait/ability level (Sim & Rasiah, 2006).

Consequently, items with higher difficulty (or lower endorsability) are better suited

to test-takers at the top end of the ability/trait continuum, and easier items (or items with

a higher endorsability) are better suited to test-takers at the bottom end of the ability/trait

continuum (Sim & Rasiah, 2006). The better the match of the item difficulty/endorsability

to test-takers at a particular ability/trait level the lower the error of measurement because

the items are approximating the test-takers relative ability/trait level (Sim & Rasiah,

2006).

Put another way, the more closely the difficulty/ endorsability of the item is matched

to the relative ability/trait level of test-takers the more precisely and accurately the

ability/trait is estimated for a particular test-taker (Hambelton & Jones, 1993). Traub

(1997) refers to the historical work of Eisenhart (1986) when he states that “…persons of

considerable note maintained that one single observation, taken with due care, was as

much to be relied on as the mean of a great number.” (p.8).

32

Tests that are dependent on classical test theory introduce greater measurement

error by employing items that are not necessarily approximated to test-takers with

different ability/trait levels across the ability or trait continuum (Hambelton & Jones,

1993). To deal with such error more items are usually administered so that measurement

error and consequently scale reliabilities can be improved despite the increased burden

these extra items place on test-takers.

2.3.2. The impact of the number of test items administered

In classical test theory the more items are included in a test the more accurately and

reliably the test usually measures the latent construct under investigation (Harvey &

Hammer, 1999). This has resulted in instruments with hundreds of items in order to

improve the mean error of measurement and thus reliability (Pallant & Tennant, 2007).

Consequently, most instruments with good psychometric properties have many items to

boost reliability which increases the burden on test-takers and increases the time taken

for tests to be administered. As item response theory determines each item’s

difficulty/endorsability individually, only those items with that are the most relevant for

each test-taker is used (Hambleton & Jones, 1993). This greatly shortens the instrument

and results in a lower error of measurement and consequently acceptable reliability

without the need to have a large number of items administered (Georgiadou, Triantafillou,

& Economides, 2006). It is important to understand however, that items can only be

administered in a targeted manner (adaptive manner) if the items demonstrate local

independence and are invariant across diverse groups.

33

2.3.3. Local independence, person-free items estimation, and item-free person

estimation

The way that the standard error of measurement is dealt with in classical test theory

has further repercussions beyond individual item error and reliability. A major

disadvantage of classical test theory is that test-scales have total scores which are

dependent on the test-scale itself (de Klerk, 2008). With reference to section 2.3.1, if an

error is made on one or more items in a test-scale this error is applicable to all the test-

takers in the norm group (de Klerk, 2008). Because the error of measurement is assumed

to be constant across test-scale items such items are not investigated for their

multicollinearity; therefore, certain test-scale items may be dependent on other test-scale

items for their measurement properties especially regarding their measurement error (de

Klerk, 2008). Test-scale items may therefore be locally dependent as opposed to being

locally independent. Local independence of items is required for adaptive testing to take

place and is discussed in the following section.

2.3.3.1. Local independence

Local independence is a fundamental requirement for test-scales based on item

response theory such as computer adaptive tests (Bond & Fox, 2007; Weiss, 2004). To

determine whether local independence is attained items are tested for their association

with one another (Monseur, Baye, Lafontaine, & Quittre, 2011). Only those items that are

related to one another due to a common latent construct, and not some other construct,

can be considered locally independent (Gershon, 2004). Items that are also dependent on

other items beyond the association they share with the latent construct under investigation

are considered confounding (Monseur et al., 2011). Therefore, each item in a test-scale

should measure the latent construct the test-scale claims to measure, and do so

34

independently from other items in the test-scale (Engelhard, 2013). This is done by

inspecting the correlations of the residuals of items (Dodd et al., 1995) or determining the

dimensionality of the test-scale (Engelhard, 2013).

If the correlations between the residuals – the error component of the test items –

are high, the items are not locally independent but are also dependent on other items in

the test-scale that measure some joint error component (Dodd et al., 1995). This

effectively makes items dependent on other items in the test-scale and also detracts from

measurement of the latent construct under investigation (Engelhard, 2013).

The Rasch rating scale model reports dimensionality and test-dependence through

fit statistics and the intercorrelation of item residuals (Weiss, 2013). The Rasch model

therefore specifies that items in a test-scale need to measure only the construct under

consideration (Engelhard, 2013). Therefore, the test-scale items must measure a single

dimension which can be determined through a principal components analysis of residuals,

or hierarchically testing the single and multifactor models of hierarchical personality

scales as we have done (Engelhard, 2013). If local independence holds where the

correlations between residuals is small and the items of a scale measure only one

underlying construct – a single factor –, then each item can be used independently from

the other items in the test-scale to approximate the latent trait (Kersten & Kayes, 2011).

This allows test-scales based on item response theory to be item-independent and

unidimensional, which greatly maximises the flexibility of testing as singular items can

be used free of the constraints of other items in the test-scale (Kersten & Kayes, 2011).

Although local independence frees test-scale items from other test-scale items so

that items can be used independently and autonomously from the other items in a test-

scale (Monseur et al., 2011), these items are still sample dependent. The next section will

look more closely at how item response theory models can effectively free test-scale items

35

from the sample used to estimate their properties, and conversely how person trait

estimates can be freed from the sample of items employed.

2.3.3.2. Item-free person calibration and person-free item calibration

In classical test theory items are definitively dependent on the sample used to

estimate their difficulty/endorsability and the sample of items used to measure a particular

trait (de Klerk, 2008). Therefore, items based on classical test theory may be test-

dependent and sample dependent which means that the properties of the construct

measured (the true score) is dependent on the nature of the items used and the nature of

the sample used to estimate item properties (Gershon, 2004). As Schmidt and Embretson

(2003) explain, test-dependence is when the trait level of persons is biased by the

characteristics of the items and sample dependence is when the nature of the items is

biased by the sample used to estimate the items’ properties.

A major repercussion of test-dependence and sample-dependence is that the same

instrument with the same administration techniques must be implemented incrementally

and iteratively in the same manner to a group of test-takers that approximate the sample

on which the test items’ properties are based (Embretson & Hershberger, 1999; Schmidt

& Embretson, 2003). Using only certain items of an instrument and comparing results

from these items from varying test-takers to the full instrument or scale (which are based

on a particular norm group) is considered poor practice as the total psychometric

properties of each instrument or scale is different and cannot technically be compared

(Smith & Smith, 2004).

Macdonald and Paunonen (2002) give an example with ability testing where the

relative ability of test-takers is dependent on whether the items are easy or difficult, and

the relative difficulty of the items is dependent on the ability of the test-takers. The authors

36

compare this to tests based on item response theory where each item is calibrated for

relative difficulty with a particular sample and should maintain this difficulty with

persons of varying abilities. Similarly, person ability should remain the same no matter

the sample of items used. Additionally, the ability/trait level measured by the items should

remain invariant if not used in conjunction with other items in the test.

In the Rasch family of item response theory models item difficulties/endorsabilities

are estimated free from person ability and vice versa (Kersten & Kayes, 2011). Although

each item’s relative difficulty/endorsability is determined through calibration with a

particular group of test-takers, and person ability/trait level is calibrated through the

application of a certain item-bank, the Rasch model estimates these parameters (item

difficulty/endorsability and person ability/trait level) without being dependent on the

former or the latter (Bond & Fox, 2007). Item-free person calibration is represented

mathematically in the following formula adapted from Schmidt and Embretson (2003).

𝑙𝑛𝑃(𝑋𝑖1)

1 − 𝑃(𝑋𝑖1)= 𝜃1 − 𝛽𝑖

and

𝑙𝑛𝑃(𝑋𝑖2)

1 − 𝑃(𝑋𝑖2)= 𝜃2 − 𝛽𝑖

Where 𝜃1 and 𝜃2 are the ability/trait level scores for test-taker 1 and test-taker 2

respectively; 𝛽𝑖 is the difficulty/endorsability of the item; and the left hand side of the

equations represent the natural log-odds of the item responses for test-taker 1 (𝑋𝑖1) and

test-taker 2 (𝑋𝑖2) respectively. When taking the difference between the two equations 𝜃1

- 𝜃2 the item effectively drops from the equation resulting in item free person estimation.

Similarly, person-free item estimation can be explained by the following algebraic

formula (adapted from Schmidt & Embretson, 2003):

37

𝑙𝑛𝑃(𝑋1𝑠)

1 − 𝑃(𝑋1𝑠)= 𝜃𝑠 − 𝛽1

and

𝑙𝑛𝑃(𝑋2𝑠)

1 − 𝑃(𝑋2𝑠)= 𝜃𝑠 − 𝛽2

Where 𝜃𝑠 is person s, 𝛽1 is the difficulty/endorsability of item 1, 𝛽2 is the

difficulty/endorsability of item 2; and the left hand of the equation represents the natural

log-odds of the item responses for test-taker s on item 1 (𝑋1𝑠) and item 2 (𝑋2𝑠)

respectively. Taking the difference between the two equations for either person 𝛽1 - 𝛽2

effectively drops the person from the comparison.

In fact, item difficulties/endorsabilities are actually estimated free from the

distribution of persons on whom the items have been calibrated and vice versa, which is

a necessary requirement for measurement invariance (Bond & Fox, 2007).

Thus, the assumption is made that if an item is similar in its difficulty/endorsability

for a number of test-takers, it will most likely maintain this difficulty/endorsability level

with different group of test-takers; and alternatively a group of test-takers will have the

same relative difference in ability/ trait level no matter the difficulty/endorsability of the

test items provided (Kersten & Kayes, 2011). This effectively frees the test from its

dependency on a norm group and allows singular items to be used to discriminate between

different groups of persons with varying abilities.

Item response theory thus determines the item and person properties individually

and in the case of the Rasch family of measurement models does so by ordering the

38

relative difficulties/endorsabilities of items in conjunction with one parameter; the

ability/trait levels of persons (Bond, 2003).

It is this characteristic of Rasch models (person-free item estimation and item-free

person estimation) that allows alternate forms of the test to be generated and compared

and thus for items to be used independently from the whole test, or the persons used to

estimate the properties of the test items (Embretson & Hershberger, 1999). What is

important in item response theory is that the location of an item, and person, on the latent

trait (either more able, or less able, or more difficult, or less difficult) remains the same

(invariant) no matter the sample of persons or items used (Engelhard, 2013). What is also

evaluated in item response theory is whether the individual items function independently

from all the other items in the given instrument or scale (de Klerk, 2008).

2.3.4. Measurement invariance

Local independence, person-free item estimation, and item-free person estimation

allow for measurement invariance. Rasch (1961) defined measurement invariance as a

fundamental requirement of testing where the:

“…comparison between two stimuli should be independent of which particular

individuals were instrumental for the comparison; and it should also be independent of

which other stimuli within the considered class were or might also have been compared.

Symmetrically, a comparison between two individuals should be independent of which

particular stimuli within the class considered were instrumental for the comparison; and

it should also be independent of which other individuals were also compared, on the same

or some other occasion.” (p. 332).

In other words, measurement invariance allows for test-items (stimuli as Rasch

refers to them) to be used free from other test-items; their relative difficulty to be

39

independent of the particular sample of persons used to estimate the item

difficulty/endorsability and vice versa (Engelhard, 2013). This greatly improves the

flexibility of item and test administration and allows items to be used independently and

thus adaptively.

If measurement invariance is established, it frees tests from the constraints imposed

by classical test theory. With these constraints lifted (1) tests can be given to test-takers

without constant referral to a particular norm group. (2) Test items can be administered

independently from the whole test. (3) Items can be targeted to persons with specific

estimated ability/trait levels. (4) Fewer items can be used because irrelevant items for

specific test-takers are excluded. (5) Because items are targeted to test-takers a lower error

of measurement results thus reducing items administered without negatively affecting

reliability. Freedom from these constraints allows tests based on item response theory

models to be used adaptively and thus allows for exploitation of the advantages of

computer adaptive testing, which we expand upon in the next section.

2.4. The advantages of computer adaptive testing

The foremost advantage of using computer adaptive tests in psychometric testing

is their reliance on item response theory, which simultaneously improves the utility and

the rigour with which tests are developed and evaluated (Thompson & Weiss, 2011). In

the previous section some of these advantages have been discussed and the drawbacks of

classical test theory presented. The foremost advantage of item response theory is that it

overcomes many of the psychometric weaknesses, constraints on item administration and

shortcomings of classical test theory (Embretson & Hershberger, 1999; Embretson &

Reise, 2000; Linacre, 2000; Thompson & Weiss, 2011; Wang & Kolen, 2001). However,

40

item response theory also allows for computer adaptive testing, which in itself holds many

practical advantages for testing.

Chien et al. (2009), de Ayala (2009), Forbey and Ben-Porath (2007), Linacre

(2000), and Weiss (2004) discuss some of these practical advantages which include (1)

increased relevance for test-takers; (2) reduction of testing time; (3) reduction of the

burden of assessment; (4) immediate test feedback; (5) assessment in multiple settings;

(6) greater test security; (7) invariant measurement; (8) error estimates on the item-level;

and (9) the general advancement of testing. These practical advantages are elaborated

upon in the forthcoming sections.

2.4.1. Increased relevance for test-takers

Firstly, items of a particular ability (or trait level) are matched to the ability or trait

level of test-takers in computer adaptive testing (Linacre, 2000). Matching an item’s

difficulty or trait level to those of the test-taker allows only the most relevant items, for

particular test takers, to be used in the testing process (Hol et al., 2005). This motivates

and engages test-takers and avoids exposing them to items that are irrelevant to their

ability/trait level. In full-form non-adaptive tests items of varying difficulty/endorsability

are given to every test-taker even if these test-takers have varying ability/trait levels that

do not match such items. Linacre (2002a) describes the administration of items of varying

difficulty/endorsabilities to test-takers with different abilities/trait levels as akin to ‘flow’

which is a hyper-engaged state.

Linacre (2002a) explains that challenges and skills have to be in balance for test-

takers to experience flow in the testing process. For this to happen the relative skills of

test-takers (ability/trait levels) must be matched to the challenges (item

difficulty/endorsability) and engagement for the test-taker. In computer adaptive tests

41

each item’s relative difficulty/endorsability is iteratively matched to approximate the

ability/trait level of the test-taker, which increases the relevance and engagement of the

test for the test-taker (Eggen & Verschoor, 2006). This also reduces the likelihood of

random responding by disengaged test-takers who may feel that items are irrelevant to

them.

2.4.2. Reduction of testing time

Testing time is shortened dramatically in computer adaptive testing with numerous

authors indicating at least a 50% reduction in the number of items used (Eggen &

Verschoor, 2006; Forbey, Handel, & Ben-Porath, 2000; Frey & Seitz, 2009; Stark et al.,

2012). This is because only enough items are administered to measure the particular

ability or trait level in question within the parameters of an acceptable level of precision

(Weiss, 2004).

The number of items administered to a test-taker is dependent on the standard error

with which the test-taker’s ability/trait level is estimated (Weiss, 2004). In most cases a

standard error of .25 or less is considered acceptable but this depends on the context of

the test (Weiss, 2004; Thompson & Weiss, 2011).

Of course, more items are administered for test-takers who have extremes of the

ability/trait in question as fewer items exist that closely approximate the test-taker’s

ability/trait level at these upper and lower extremes and thus increase the standard error

with which the test-taker’s ability/trait level is estimated (Weiss, 2011). More items are

used with such extreme cases to more precisely estimate the ability/trait level of the test-

taker in question thus helping to offset larger measurement error indices (Stark et al.,

2012; Weiss, 2011).Luckily, most test-takers approximate the average levels of a normal

distribution regarding the latent construct under investigation and most items will

42

therefore be matched appropriately to most individual test-takers thus reducing the

number of items required to estimate ability/trait level with an acceptable amount of

measurement error.

Shortening the test substantially has major practical advantages for clinical testing

where test-takers are often distressed and barely capable of completing an inventory.

However, personality testing stands to benefit from an increase in efficiency especially in

industrial/organisational settings where test administration and prolonged testing time is

costly.

2.4.3. Reducing the burden of testing

The burden of testing is also reduced in computerised adaptive testing (Chien et

al., 2009). Most personality tests have hundreds of items which puts strain on the test-

taker (Chien et al., 2009). Shorter, more relevant instruments reduce the strain imposed

on test-takers and this is particularly relevant for test-takers who are exposed to multiple

tests for selection, placement, development, or clinical evaluation or those for whom the

test is not in a first language (Stark et al., 2012).

2.4.4. Immediate feedback after testing

Test-takers and test-administrators are given immediate feedback on results

directly after computer adaptive testing has taken place (Forbey & Ben-Porath, 2007).

With computer adaptive testing, the test-taker’s ability/ trait level is estimated

continuously during the computer adaptive testing process. In this regard test-feedback

can be given directly after test-administration which holds advantages for test-takers such

as reduced anxiety (DiBattista & Gosse, 2006), improved learning (Attali & Powers,

2008), and an improved test experience (Betz & Weiss, 1976). However, computer-based

43

tests based on classical test theory now also offer immediate feedback after testing and

thus means this advantage is not unique to computer adaptive tests.

2.4.5. Testing is not limited to one setting

Computer-based testing is not limited to a single test-administration period and can be

done over the internet (Linacre, 2000). Therefore, anyone with internet connectivity can

complete the test at any time and receive feedback (de Ayala, 2009). This greatly

increases the ‘reach’ and convenience of testing. Again however, this advantage is not

unique to computer adaptive testing, but is also exploited in some computer-based tests

based on classical test theory.

2.4.6. Greater test security

As the test is computer-based and the items used are tailored to the individual test-

taker the test is not as easily invalidated by numerous testing instances (Weiss, 2004).

This is because computer adaptive tests provide numerous so-called ‘parallel forms’ of

the test for each test-taker (Ortner, 2008). As each test taker answers test items in a

different manner, the items presented differ for each test-taker. The items presented are

based on the number of correct/incorrect or endorsed responses that impact what items

with a specific difficulty/endorsability are presented to the test-taker. It is therefore very

difficult for test-takers to anticipate test items as each test administration is unique. The

test-items are also not freely available to the test-taker as they are electronically stored in

an item-bank and not in paper/pencil format. These factors greatly increase the security

of the test and prevent the items of an instrument from falling into the hands of test-takers.

44

2.4.7. Invariant measurement

All the items in an item response theory pre-calibrated item bank are calibrated on

the same scale of measurement (Weiss, 2004). Therefore, these items can be used

independently from the entire item-bank to formulate multiple tests that are not test-

dependent (de Ayala, 2009). Also, these items are estimated free of the test-takers on

which the test has been calibrated (Bond & Fox, 2007). Therefore, the test is applicable

to more than just the test norm group and only requires further calibration with test-takers

in the future to remain psychometrically viable.

2.4.8. Error estimates for each item

A major advantage of item response theory is that each item’s location estimate –

on the latent trait – can be established with a certain standard error (Weiss, 2004).

Consequently, items can be identified that evidence the lowest possible standard error of

location estimates and these can be preferentially administered to test-takers within the

computer adaptive testing framework.

Administering items that have precise location estimates is advantageous as these

items also improve the precision with which person locations are estimated (Thompson

& Weiss, 2011).

2.4.9. Advancement of psychometric testing in general

As computer-based testing is growing quickly and a more advanced test-theory is

being applied to testing; developing instruments based on item response theory advances

the field of psychometric testing. There is also a strong need for psychological tests to

move into the digital realm. With the many advantages of computer-based testing (Chien

et al., 2009) and a desire for shorter more relevant instruments (Gershon, 2004; Haley et

45

al., 2006); computer adaptive testing has become a necessary addition to psychometric

testing.

With the practical and methodological advantages of computer adaptive testing, the

development of computer adaptive tests of personality may hold much promise. This is

especially true for personality testing, which has become prolific in job-selection, job-

placement, and career development in the last 23 years (Higgins, Peterson, Lee, & Pihl,

2007), and yet has not shared the same progress as other constructs in the computer

adaptive arena (Simms et al., 2011). However, before computer adaptive tests of

personality can be developed tests have to pass strict evaluative requirements. We discuss

some of these requirements in the next section.

2.5. The requirements for the development of computer adaptive tests

Weiss (2004) and Hol et al. (2008) list the evaluative requirements an inventory of

items must undergo when being prepared for computer adaptive testing. These

requirements include (a) establishing the dimensionality of a set of items for particular

scales (i.e., ensuring that each scale is unidimensional); (b) calibrating items for each

scale using an item response theory framework or model; (c) and finally simulation of

computer adaptive testing to establish equivalence with the full-form non-adaptive test.

If items of a scale are unidimensional, fit an item response theory model, and are

shown to be equivalent to the non-computer adaptive test when simulated in a computer

adaptive framework, the test-scale items can be considered ready for real-time computer

adaptive testing (Weiss, 2004). The more practical steps of writing computer adaptive

algorithms; initiating real-time testing with test-takers; and then evaluating the real-time

computer adaptive test can then be undertaken.

46

However, the evaluation of the test’s scales for undimensionality, item response

theory model fit, and equivalence through computer adaptive simulation remain the most

important elements in the evaluation and development of a computer adaptive test

because without these fundamental evaluations the test may not function with the

necessary accuracy, precision, efficiency or reliability in its computer adaptive format.

Also, as mentioned earlier most studies have focused on the feasibility of the development

of computer adaptive tests of personality without fundamentally comparing the validity

of computer adaptive versions the tests with their non-computer adaptive counterparts.

This study aims to address this shortcoming in the research.

The development of a computer adaptive test of personality therefore needs to be

justified through extensive analysis, evaluation and testing (de Ayala, 2009; Meijer &

Nering, 1999; Thompson & Weiss, 2011).

2.6. Preview of the contents of the following chapters

The present study has three overarching objectives. Firstly, the dimensionality of

the BTI scales was investigated because unidimensionality of test scales are important for

fit to item response theory models and consequently for computer adaptive applications

(refer to Chapter 3). Therefore, the scales of the BTI were fit to a single, multi, and

bifactor model to determine to what extent a general factor underlies the various facets of

the BTI.

The second objective was to fit the BTI scales to the Rasch rating scale model,

which is an item response theory model (refer to Chapter 4). The fit of the items of the

BTI scales was therefore evaluated and poor fitting items removed to prepare the BTI for

computer adaptive application. Aspects investigated include item and person fit to the

Rasch model, as well as item bias through differential item functioning analysis for ethnic

47

and gender groups. The best fitting items were selected for use and comparison within a


The last objective was to simulate the BTI as a computer adaptive test and compare

the functioning of the computer adaptive version of the BTI with its non-computer

adaptive counterpart (refer to Chapter 5). Aspects investigated include how closely the

computer adaptive BTI scales estimated person parameters to the non-adaptive full form

test, item efficiency, and the standard error of person parameter estimation. These

evaluations were conducted so that the feasibility of the computer adaptive version of the

BTI scales could be determined.

Chapter 6 integrates the findings of Chapter 3, Chapter 4; and Chapter 5 and

provides a more conceptual discussion around the computer adaptive testing of

personality. This chapter will also integrate limitations of the studies, implications for

practice, and recommendations for future research.

48

CHAPTER 3: THE DIMENSIONALITY OF THE BTI SCALES

“It is interesting to ask, to what degree do these domains emerge as multidimensional, and in

turn, does this multidimensionality interfere with our ability to fit an IRT model and scale

individuals on a common dimension?” – Steven P. Reise (2011, pp.83-84).

3.1. Introduction

A fundamental prescription of good psychological measurement is that a psychometric

scale should be unidimensional (i.e., the scale should measure only one attribute), because it

facilitates clear and unambiguous interpretation and allows for total score interpretation of

measurement scales. This is also a fundamental prescription for item response theory models

such as the Rasch rating scale model where the unidimensionality of measurement scales need

to be demonstrated or item and ability/trait parameterization may be biased (Yu, Popp,

DiGangi, & Jannasch-Pennell, 2007).

Since computer adaptive tests are heavily reliant on item and ability/trait parameters to

effectively select and administer items to test-takers; unreliable or biased parameters will

severely undermine the testing process (Thissen, Reeve, Bjorner, & Chang, 2007; Wise &

Kingsbury, 2000). As computer adaptive tests select items based on their relative

difficulty/endorsability and uses the responses to these items to estimate test-takers’ standing

on a single latent construct (Eggen & Verschoor, 2006) the items need to measure such a

construct as exclusively as possible for accurate and precise ability/trait level estimation

(Weiss, 2004).

Whereas fit to an item response theory model is necessary to identify items that do not

meet the assumptions of the model effectively; good fit of such a model is a necessary but not

sufficient condition for the evaluation of dimensionality (Wise & Kingsbury, 2000). Other

49

techniques based in classical test theory and common factor theory, such as exploratory and

confirmatory factor analysis, are often used in conjunction with item response theory models

to analyse and explore the underlying structure of inventories (ten Holt, van Duijn & Boomsma,

2010; Li, Jiao, & Lissitz, 2012; Reise, Widaman, & Pugh, 1993).

Although factor analytic techniques are useful for the investigation of the

dimensionality of personality scales, and complement item response theory, the

unidimensionality of personality scales are often disputed. The structure of hierarchical

personality scales will therefore be discussed in the next chapter.

3.1.1. Testing the dimensionality of hierarchical personality scales

Although unidimensionality is an important and necessary requirement for good

measurement when using classical test theory and one-dimensional item response theory

models; personality scales are not wholly unidimensional in nature (Clifton, 2014). For

instance, personality psychologists routinely use omnibus personality inventories that contain

hierarchical scales, such as the Hogan Personality Inventory (HPI; Hogan & Hogan, 2007) and

the NEO Personality Inventory Revised (NEO-PI-R; McCrae & Costa, 2010). These

inventories measure a small number of broad personality traits (five for the NEO-PI-R, and

seven for the HPI) at the total score level (Costa & McCrae, 1995; Hogan & Hogan, 2007), and

a larger number of narrower traits at the subscale level. Operationally, a broad trait such as

Extroversion is represented by a total score that is the sum of the nested subscale scores that

each measures a narrow aspect of Extraversion. This structure holds the promise of a two-tiered

interpretation, where a broad and general description is obtained at the total score level, and

more specific descriptions are obtained at the subscale score level (cf. Paunonen, 1998).

Therefore, conventional factor analysis of the items of a well-constructed hierarchical

personality scale (e.g. the Extroversion scale of the NEO-PI-R) is likely to yield multiple

50

correlated factors that correspond with the subscales (i.e., a multidimensional solution), which

raises questions about the unidimensionality of personality scales. This has consequences for

the unidimensional assumption of one-dimensional item response theory models, especially in

the wake of the development of multidimensional item response theory which has

circumvented the much held unidimensionality assumption of such models (Segall, 1996;

Wang, Chen, & Cheng, 2004).

Multidimensional item response theory models assume that numerous constructs or

continua underlie a particular scale or item. Since these models assume multidimensionality of

scales or items, the model uses a vector across multiple dimensions to estimate the trait level

for persons on a particular item or set of items (Reckase, 2009). In this way total scores for a

particular set of constructs can be calculated across a number of multidimensional criteria and

subscale scores can also be estimated based on the linking and scaling of multiple dimensions

(Reckase, 2009). Since ability/trait level estimation is based on multiple dimensions these

estimates can be used in multidimensional computer adaptive testing frameworks (Haley et al.,

2006). In contrast, one-dimensional item response theory models provide a single trait estimate

for a single construct, which are used in standard computer adaptive testing frameworks (Li et

al., 2012).

Therefore, multidimensional item response theory models may seem more applicable

to hierarchical personality inventories because these instruments have a possibly

multidimensional structure. Unfortunately, multidimensional item response theory models are

overly complex (violate parsimony); struggle to deal effectively with polytomous items, which

are predominantly used in personality inventories; have item parameters that are estimated with

varying degrees of stability (stability of these models have not been demonstrated); have fewer

tests for model fit; and have complicated total score interpretations (Thissen et al., 2007).

51

It may therefore be simpler and more prudent to initially investigate personality scales

using a one-dimensional item response theory model in conjunction with factor analysis (Li et

al., 2012). Although such models limit the tests in the sense that each scale of the test must be

investigated separately and administered with its own parameter estimates and algorithms, the

parsimony established by one-dimensional item response theory models cannot be denied

(Thissen et al., 2007). Such models are also limited by their strict unidimensionality

assumptions (assumed in one-dimensional item response theory models and imposed by the

researcher in confirmatory factor analytic models).

As the unidimensional structure of hierarchical personality scales are a matter of

contention, the goal of researchers investigating such scales should not be to establish

unidimensionality of scales in the purest sense, but rather to determine whether such scales are

‘unidimensional enough’ for unbiased parameter estimation and measurement within one-

dimensional item response theory frameworks (Reise et al., 2011; Reise, Moore, & Maydue-

Olivares, 2011). If scales are shown to evidence unidimensionality to some dominant degree,

it may allow for the application of one-dimensional item response theory models required for

computer adaptive testing.

To determine whether scales are ‘unidimensional enough’ researchers have to

determine whether a strong and reliable general factor dominates responses to the items that

constitute a total score for a particular personality scale (McDonald, 1999; Reise, Bonifay &

Haviland, 2013; Zinbarg, Revelle, Yovel, & Li, 2005). In essence, the second order factor

structure (general factor), which is represented by a total score on a scale, needs to be compared

to the first order factor structure (group factors) represented by the subscales used. To evaluate

the first and second order structure of the scales of hierarchical personality inventories

psychologists can apply hierarchical factor analytic techniques such as a bifactor analysis

52

(Holzinger & Swineford, 1939). This technique, and its application, is described in the

following section.

3.1.2. Evaluation of the dimensionality of hierarchical personality scales

The bifactor model specifies that a set of manifest variables (e.g., the items of a

personality scale such as Extraversion) is influenced by (a) a general factor that influences each

of the manifest variables, and (b) two or more group factors that each influences only a subset

of the manifest variables (Reise et al., 2013). Typically, the factors of a bifactor model are

specified as orthogonal because the general factor absorbs all the variance that is common to

all the manifest variables (Reise et al., 2010; Reise, Morizot, & Hays, 2007). Operationally,

the general factor corresponds with the total score across all the items and the group factors

correspond with the subscale scores. The group factors can be thought of as residualised

factors. In this sense the group factors, which are represented by subscales, may indicate

whether the subscales reliably measure anything other than the general factor (Reise et al.,

2013).

Consider a hypothetical nine-item scale with a hierarchical structure, where summation

across the nine items yields a total score, and items 1 to 3 constitute Subscale A, items 4 to 6

constitute Subscale B, and items 7 to 9 constitute Subscale C. Three competing factor analytic

models may be specified for the scale, namely a one-factor model (Model 1), a correlated three-

factor model (Model 2), and a bifactor model with a general factor and three group factors

(Model 3). The one-factor model specifies that the nine items measure a single trait, which

implies that the total score should be interpreted (see Figure 1a). The three-factor model

specifies that the nine items measures three separate (but correlated) traits, which imply that

three separate scores should be interpreted. In both these models each item is influenced by

one factor only. The bifactor model, however, specifies that each item is influenced by two

53

factors, namely a general factor that is common to all the items, and a group factor that is

common only to a subset of the items. Because the factors of the bifactor model are specified

as uncorrelated, Model 3 allows for a comparison of the degree to which the variance of an

item is explained by the general factor as opposed to the group factors (Reise et al., 2013).

A bifactor analysis that yields a very strong general factor and trivial group factors

evidences unidimensionality at the scale level. In turn, a weak general factor and strong group

factors indicates multidimensionality at the scale level. Finally, a strong general factor and non-

trivial group factors indicates that interpretation at the scale level is warranted with the joint

possibility of interpretation at the subscale level, in the sense that the general factor captures

reliable trait variance that is common to all the items and the subscales capture reliable trait

variance that is not attributed to the general factor (Reise et al., 2013). Such a model would be

more realistic for hierarchical personality inventories and would be capable of fitting one-

dimensional item response theory models required for computer adaptive testing.

The choice between these three factor models is based on their relative fit with

empirical data (Bollen, 1989). In accord with general scientific principles of parsimonious

description, Model 1 is the most desirable, followed by Model 2, and then by Model 3. In

practice, however, it may turn out that the simpler models do not give a satisfactory account of

the data. Good fit for Model 1 will be achieved if the scale measures one factor exclusively.

Model 2 will fit better than Model 1 if the scale measures three separate factors. In turn, Model

3 will fit better than Model 2 if the scale measures a general factor and three separate group

factors simultaneously.

Since the objective of the study is to prepare the BTI, a hierarchical five factor model

of personality, for computer adaptive testing; the application of a bifactor model to the BTI

scales is warranted. This is primarily because the dimensionality of scales must be confirmed

before application to a one-dimensional item response theory model can take place. An

54

understanding of the theoretical structure and psychometric properties of the BTI is therefore

necessary. Consequently, the following section will briefly describe the BTI regarding its

hierarchical structure and psychometric properties.

3.1.3. The Basic Traits Inventory (BTI)

The BTI is a hierarchical personality inventory that measures the Big Five personality

traits namely Extraversion, Neuroticism, Conscientiousness, Openness to Experience, and

Agreeableness (Taylor & de Bruin, 2006). Each scale consists of four or five subscales, which

in turn consist of between six and ten items.

The structure of the BTI scales and subscales is as follows: Extraversion (36 items) –

Ascendance (7 items), Liveliness (8 items), Positive Affect (6 items), Gregariousness (7 items),

Excitement Seeking (8 items) ; Neuroticism (34 items) – Affective Instability (8 items),

Depression (9 items), Self-consciousness (9 items), Anxiety (8 items); Conscientiousness (41

items) – Effort (8 items), Order (10 items), Duty (9 items), Prudence (6 items), Self-discipline

(8 items); Openness (32 items)– Aesthetic (7 items), Ideas (6 items), Actions (7 items), Values

(6 items), Imaginative (6 items); and Agreeableness (37 items) – Straightforwardness (7 items),

Compliance (8 items), Prosocial Tendencies (8 items), Modesty (7 items), Tendermindedness

(7 items) (Taylor & de Bruin, 2013). The subscales were selected on the basis of Big Five

theory and a comprehensive review of published empirical literature. In particular, previous

factor analytic studies were scrutinised to identify the subscales that best represent each of the

five broad traits. In addition, subscales that saliently loaded only their targeted factors and

demonstrated small loadings on non-targeted factors were selected (Taylor, 2004). The BTI

provides scores on the total score level (referred to as factors) and on the subscale score level

(referred to as facets), and therefore—like the NEO-PI-R and the HPI—it allows for a two-

tiered interpretation of scale scores (Taylor & de Bruin, 2013). To date, factor analyses of the

55

BTI subscales have yielded strong support for the hypothesised five-factor solution (de Bruin,

2014; Metzer, de Bruin, & Adams, 2014; Ramsay, Taylor, de Bruin, & Meiring, 2008; Taylor

& de Bruin, 2006; 2013). However, the hierarchical structure of the five scales of the BTI and

the relative strength of the general factor (corresponding with the total score) and the group

factors (corresponding with the subscale scores) have not been investigated.

The aims of the present study are twofold. Firstly, the study examines the hierarchical

structure of each of the five BTI scales by fitting three factor analytic models, namely: a one

factor model (Model 1), a correlated multifactor model (Model 2), and an orthogonal bifactor

model (Model 3). Secondly, the study demonstrates how the bifactor analytic technique could

be employed for the evaluation of the dimensionality of a hierarchical personality instrument

especially with regard to the subfactors (represented by subscales) possibly introducing

multidimensionality at the scale level. It was hypothesised that the bifactor model would

provide the best fit due to the two-tiered structure of the BTI scales. More generally, the

methods used Examiner 2abasemay guide the evaluation of other hierarchical personality

instruments that are interpreted at both a total score (for scales) and at a subscale level. Most

importantly, this evaluation of the dimensionality of the BTI scales allows the suitability of the

scales for computer adaptive testing applications to be evaluated.

3.2. Method

3.2.1. Participants

Participants were 1,962 South African adults who completed the BTI for selection,

development, counselling and placement purposes. Participants represent working men (62%)

and women (38%) with a mean age of 33 years (SD = 9.08, Md = 33 years) from all provinces

in South Africa. The majority of the participants were Black (54%) and White (36%) with the

remainder being of Mixed Race (6%) and Asian (4%) ethnicities. While not a random sample

56

– convenience sampling using an existing database – participants reflect the white collar

working population demographic in South Africa (Statistics South Africa, 2012). All

participants completed the BTI in English.

3.2.2. Instrument

The BTI makes use of a five-point Likert-type scale with response options that

range from (1) ‘Strongly Disagree’ to (5) ‘Strongly Agree’ (Taylor & de Bruin, 2006,

2013). The BTI has satisfactory reliability on the scale level with each of the five scales

demonstrating Cronbach alpha coefficients of above .87 across different South African

ethnicities (Taylor & de Bruin, 2013). The Cronbach alpha coefficients of the subscales

range from .44 for Openness to Values to .85 for Affective Instability with most of the

subscales reflecting reliability coefficients of above .75 (Taylor & de Bruin, 2013). Factor

analyses of the subscales have yielded strong evidence in support of the construct validity

of the Big Five traits. Further factor analyses indicated good congruence between the

factor structures for Black and White groups in South Africa with Tucker’s phi

coefficients > .93 for the five scales (Taylor & de Bruin, 2013).

3.2.3. Data Analysis

Analyses were conducted using the lavaan package (Rosseel, 2012) in R (R Core

Team, 2013). The data were subjected to three confirmatory factor analytic (CFA)

models as discussed in the introduction. The three CFA models were the single-factor

(Model 1), multifactor (Model 2), and bifactor models (Model 3) respectively.

Each of the CFA models were separately fitted to the five BTI scales. The items

were treated as ordered categorical variables and all parameters were estimated with the

57

mean and variance adjusted weighted least squares (WLSMV) estimator (cf. Flora &

Curran, 2004).

Model fit was evaluated with reference to the WLSMV chi-square (WLSMVχ²),

comparative fit index (CFI; Bentler, 1990), Tucker-Lewis index (TLI; Tucker & Lewis,

1973), and the root mean square error of approximation (RMSEA; Steiger & Lind, 1980).

CFI and TLI values > .95 and RMSEA values < .08 are recommended as cutoffs for

acceptable fit (Browne & Cudeck, 1993; Hu & Bentler, 1999). However, given Kenny

and McCoach’s (2003) observation that the CFI performs poorly in models where there

are many variables per factor, the less stringent cutoff of CFI and TLI > .90 was adopted.

Because the one factor model is nested in the correlated multifactor model and, in

turn, the correlated multifactor is nested in the bifactor model [for a more complete

discussion of nested models in a bifactor context see Chen, West and Sousa (2006) and

Reise et al. (2013)], adjusted chi-square difference tests were employed to test whether

the differences in fit between the models were statistically significant (Bollen, 1989).

McDonald’s (1999) coefficient omega (omegaT) is a reliability coefficient that

was used to determine for each BTI scale the proportion of observed variance jointly

explained by the general and group factors. Coefficient omega hierarchical (omegaH) was

used to determine the proportion of observed variance accounted for by the general factor

alone (Zinbarg et al., 2005). Finally, coefficient omega specific (omegaS) was used to

determine for each subscale the proportion of observed variance explained by the group

factors beyond the general factor (Reise et al., 2013). The proportion of the reliable

variance accounted for by the general factor (PRVG) and the proportion of the reliable

variance accounted for by the group factors (PRVS) was also calculated (cf. Reise et al.,

2013). Generally, PRVG and PRVS are expected to be higher than the corresponding

omegaH and omegaS values, because PRVG and PRVS use only the reliable variance of

58

a scale, whereas omegaH and omegaS use the total variance, which also included error

variance.

3.2.4. Ethical Considerations

Ethical clearance for this study was obtained from the Faculty Ethics Committee

in the Faculty of Management at the University of Johannesburg. Permission to use the

data for research was granted from JvR Psychometrics, which owns the BTI database.

Only BTI data for which test-takers provided consent for research was used.

Confidentiality and anonymity of the data were maintained by not including any

identifying information the BTI data-set.

3.3. Results

Fit statistics of the three models across the five BTI scales are summarised in Table

3.1. The WLSMVχ² indicated that the hypothesis of perfect fit had to be rejected for each

of the five traits across all three models (p < .01). For each of the five scales, the CFI, TLI

and RMSEA indicated better fit for Model 2 than for Model 1, and in turn, better fit for

Model 3 than for Model 2. For each of the five scales, adjusted chi-square difference tests

(refer to Table 3.2) demonstrated that Model 3 fit statistically significantly better than the

competing models (p <.001). Hence, the hypothesis that the bifactor model would provide

the best fit was supported. The remainder of the results focuses on the interpretation of

Model 3.

59

Table 3.1

Three confirmatory factor models of the structure of the BTI scales

Scale

Fit Indices E(a) E(b) N C O A

Model 1 (One-factor model)

WLSMVχ² 19844.89* 8525.29* 9172.42* 11325.17* 9059.12* 10262.90*

df 594 359 527 779 464 629

CFI .465 .712 .833 .848 .777 .802

TLI .433 .689 .822 .841 .762 .790

RMSEA .132 .112 .093 .085 .099 .090

Model 2 (Correlated multi-factor model)

WLSMVχ² 7258.89* 4384.69* 5875.50* 5976.71* 4649.13* 7237.27*

df 584 344 521 769 454 619

CFI .815 .858 .897 .925 .891 .864

TLI .800 .843 .889 .920 .881 .853

RMSEA .079 .079 .074 .060 .070 .075

Model 3 (Bifactor model)

WLSMVχ² 5466.95* 3343.01* 4168.27* 4999.26* 4130.48* 5194.15*

df 558 322 493 738 432 592

CFI .864 .893 .929 .939 .904 .905

TLI .846 .875 .919 .932 .890 .893

RMSEA .069 .071 .063 .055 .067 .064

Note. E(a) = Extraversion A, E(b) = Extraversion B, N = Neuroticism, C = Conscientiousness, O =

Openness, A = Agreeableness.

* p < .01

60

Table 3.2

Chi-square difference test of the three factor models

Model Comparison df χ² χ²Δ

Extraversion A Model 1 594 25759.90

Model 2 584 8563.80 841.52*

Model 3 558 5973.40 368.33*

Extraversion B Model 1 350 8457.30

Model 2 344 4122.90 179.69*

Model 3 322 2941.90 209.68*

Neuroticism Model 1 527 7786.60

Model 2 521 4730.50 66.51*

Model 3 493 3130.70 193.18*

Conscientiousness Model 1 779 9467.50

Model 2 769 4515.50 115.91*

Model 3 738 3595.50 79.19*

Openness Model 1 464 8552.60

Model 2 454 4087.90 215.70*

Model 3 432 3521.50 70.94*

Agreeableness Model 1 629 9970.30

Model 2 619 6804.00 116.79*

Model 3 592 4640.10 249.30*

Note. * p < .001 level.

3.3.1. Fit indices for the bifactor model (Model3)

Model 3 demonstrated acceptable fit for four of the five scales: Neuroticism (CFI

= .929, TLI = .919, RMSEA = .063); Conscientiousness (CFI = .939, TLI = .932, RMSEA

= .055); Openness (CFI = .904, TLI = .890, RMSEA = .067); and Agreeableness (CFI =

61

.905, TLI = .893, RMSEA = .064). However, in comparison less satisfactory fit was

obtained for Extraversion (CFI = .864, TLI = .846, and RMSEA = .069).

Closer inspection revealed that the Excitement Seeking subscale contributed

disproportionately to the weaker fit of the Extraversion bifactor model. Removal of the

Excitement Seeking subscale produced an improved CFI (.893) and TLI (.875), and a

somewhat weaker, but still acceptable, RMSEA (.071). Against this background, the

present results for two Extraversion bifactor models are presented in the remainder of the

chapter: Extraversion A contains all five subscales, whereas Extraversion B excludes the

potentially problematic Excitement Seeking facet.

3.3.2. Reliability of the BTI scales

McDonald’s omegaT for the BTI scales was as follows (Cronbach’s alpha is given

in parenthesis): Extraversion A, .92 (.89); Extraversion B, .92 (.90); Neuroticism, .96

(.95); Conscientiousness, .97 (.96); Openness .93 (.91); and Agreeableness, .94 (.93). In

turn, omegaH, which reflects the proportion of total variance explained by the general

factor, was as follows: Extraversion A, .76; Extraversion B, .82; Neuroticism, .92;

Conscientiousness, .92; Openness, .85; and Agreeableness, .87.

The ratio of omegaH to omegaT indicated that the proportion of reliable variance

that consists of general factor variance (PRVG) was as follows: Extraversion A, .82;

Extraversion B, .89; Neuroticism, .96; Conscientiousness, .95; Openness .91; and

Agreeableness, .93. The PRVG and omegaH coefficients indicate that the bulk of the total

variance and the reliable variance of each scale were accounted for by a dominant general

factor. Next, the reliability of the subscales is presented.

62

3.3.3. Reliability of the BTI subscales

Results showed that most of the subscales captured some non-negligible common

variance above and beyond the general factor. With only one exception, namely

Excitement Seeking, omegaS values were smaller than .50 which indicates that the group

factors accounted for less than 50% of the observed variance of subscale scores (see Table

3.3). The PRVS values mirror these findings and indicate that the group factors account

for between 4% and 52% of the reliable variance with most subscales accounting for less

than 35% of the reliable variance (see Table 3.3). It is noticeable that the omegaS of the

Liveliness and Depression subscales were low (< .10), which suggests that these two

subscales measure mostly a general factor and very little beyond that.

Table 3.3

Proportion of specific and total variance explained by factors and facets of the BTI

Factors/Facets OmegaT OmegaH OmegaS PRVG PRVS

Extraversion A .92 .76 - .82 -

Ascendance .82 .40 .42 .49 .51

Liveliness .75 .72 .03 .96 .04

Positive Affect .78 .40 .39 .50 .50

Gregariousness .86 .51 .34 .60 .40

Excitement Seeking .86 .03 .83 .03 .97

Extraversion B .92 .82 - .89 -

Ascendance .82 .40 .43 .48 .52

Liveliness .75 .71 .03 .96 .04

Positive Affect .78 .41 .37 .53 .47

Gregariousness .86 .50 .36 .58 .42

Neuroticism .96 .92 - .96 -

63

Affective Instability .90 .64 .26 .71 .29

Depression .89 .83 .06 .93 .07

Self-consciousness .85 .63 .22 .74 .26

Anxiety .89 .70 .19 .79 .21

Conscientiousness .97 .92 - .95 -

Effort .88 .58 .30 .66 .34

Order .91 .67 .24 .74 .26

Duty .90 .71 .19 .79 .21

Prudence .86 .73 .13 .85 .15

Self-discipline .89 .76 .13 .85 .15

Openness .93 85 - .91 -

Aesthetic .86 .41 .45 .47 .53

Ideas .78 .67 .11 .86 .14

Action .79 .63 .16 .80 .20

Values .62 .31 .31 .51 .49

Imaginative .87 .55 .32 .63 .37

Agreeableness .94 .87 - .93 -

Straightforward .80 .59 .21 .74 .26

Compliance .80 .62 .19 .77 .23

Prosocial .85 .53 .32 .62 .38

Modesty .71 .46 .26 .64 .36

Tendermindedness .85 .66 .19 .77 .23

Note. OmegaS = Omega specific, OmegaH = Omega hierarchical, PRVG = proportion of reliable

variance of the general factor, PRVS = proportion of the reliable variance of the specific/group factors.

64

3.3.4. The bifactor pattern matrix

The bifactor pattern matrix of one scale, namely Neuroticism, is presented to

demonstrate how the general and group factors account for the variance of the items (see

Table 3.4). The pattern matrix reveals how some clusters of items tend to have strong

loadings on the general factor, but weak loadings on a group factor (see for instance items

N9 to N17 that constitutes the Depression subscale) and where certain groups of items

tend to show relatively strong loadings on both the general and group factor (see for

instance items N1 to N8 that constitutes the Affective Instability subscale). The PRVS

(.07) and omegaS (.06) values of the Depression subscale indicate that the group factor

contributes little information beyond the general factor (see Table 3.3). By contrast, the

PRVS (.29) and omegaS (.26) values of the Affective Instability subscale indicate that

the group factor explains a fair proportion of variance beyond the general factor.

Table 3.4

Standardised factor loadings for Neuroticism (Model 3)

Items General Factor Aff.Inst. Depression Self-consc. Anxiety

N1 .54 .61

N2 .54 .64

N3 .64 .33

N4 .61 .55

N5 .66 .37

N6 .67 .19

N7 .65 .18

N8 .58 .08

N9 .58 -.06

65

N10 .66 .03

N11 .76 .04

N12 .68 .18

N13 .81 .13

N14 .47 .06

N15 .66 .24

N16 .57 .24

N17 .62 .54

N18 .25 .28

N19 .58 .13

N20 .49 .36

N21 .67 .21

N22 .55 .53

N23 .25 .53

N24 .76 .26

N25 .59 .27

N26 .50 .10

N27 .53 .54

N28 .56 .09

N29 .65 .15

N30 .60 .22

N31 .65 .29

N32 .72 .31

N33 .61 .74

N34 .66 .14

Note. Aff.Inst. = Affective Instability, Self-consc. = Self-consciousness

66

3.4. Discussion

First the model fit of the bifactor analyses for the BTI is discussed. Then the

dimensionality of the BTI scales is discussed. Lastly we review some of the implications

for fit to one-dimensional item response theory models for the preparation of computer

adaptive tests.

3.4.1. The fit of the bifactor model

This study set out to demonstrate how the hierarchical factor structure of a

personality inventory, namely the BTI, can be evaluated in order to justify whether fit to

unidimensional item response theory models are warranted and thus whether sufficient

undimensionality of hierarchical personality scales are evidenced for computer adaptive

testing. In summary, the results showed that for each of the five BTI scales a bifactor

model fit better than either a one-factor or correlated multiple factor model. The bifactor

results showed that each of the five BTI scales measured a strong general factor and four

or five discernible group factors that correspond with the BTI subscales.

Although the results indicate that each scale is multidimensional, in the sense that

multiple factors are measured (i.e., a general factor and four or five group factors) the

application of such scales to one-dimensional item response theory models appears

justified because each scale measures a dominant trait (as represented by the general

factor) as well. In this regard, results also indicate that each subscale measures a group

factor beyond the general factor. This is indicative of hierarchical personality

measurement where total scores are interpreted for broad traits and subscale scores are

interpreted for nested narrow traits.

67

3.4.2. The dimensionality of the BTI scales

It was demonstrated that the strength of the group factors varies from subscale to

subscale. In this regard, it is recommended that subscale scores that measure weak group

factors (a tentative suggested value of PRVS < .20) evidence the strongest

unidimensionality. On the other hand, it is recommended that PRVS values > .60

(recommended by the author as a suggestion) should be closely scrutinized as subscales

indicating this proportion of specific reliable variance may indicate measurement of

something beyond the general factor. For instance, the Depression and Liveliness

subscales measured little beyond the general factor of Neuroticism indicating strong

evidence of unidimensional measurement for these subscales. However, one subscale,

Excitement Seeking, indicated measurement of something beyond the general factor of

Extraversion (PRVS = .97). However, to determine whether Excitement Seeking should

be excluded before the commencement of computer adaptive testing it is recommended

that the fit of the items first be evaluated with Excitement Seeking included in the item

response theory parameterization process. Similarly, it should be determined whether the

computer adaptive version of the Extraversion scale, with Excitement Seeking included,

estimates test-takers’ standing on the latent trait equivalently to the non-adaptive full form

test-scale. Although the Extraversion B model (with Excitement Seeking removed)

demonstrated better fit to the bifactor model, with improved PRVG and PRVS values,

enough general factor dominance may be available once some poor fitting items are

removed in the item response theory parameterisation process. For this reason,

Excitement Seeking as a subscale for the Extraversion scale was retained in Chapter 4

and 5 respectively.

In general, the subscales indicated a maximal PRVS of .53 with an average PRVS

= .29 (when Excitement Seeking is removed). However, most of the PRVG values were

68

above .80 for the general factors of each scale indicating good general factor dominance.

As a whole these results indicate good evidence for general factor dominance at the scale

level.

3.4.3. Implications for fit to one-dimensional item response theory models

In accord with Reise et al. (2013) and Zinbarg et al. (2005) it is recommended that

personality psychologists evaluate hierarchical personality scales through the separation of the

general factor from the group factors using bifactor analysis. This highlights to what degree

the general factor and group factors account for the variance of the scores on a hierarchical

personality scale. Such an evaluation may give insight into dominance of the general factor for

each hierarchical scale and thus the applicability of such scales for fit to one-dimensional item

response theory models and preparation for computer adaptive testing.

Further, it is recommended that the assumption of unidimensionality of scales

is not violated for scales with a PRVG > .80 and a PRVS < .60.

Whereas only one subscale evidenced possibly interfering reliable specific variance

(Excitement Seeking) for the Extraversion scale, other scales demonstrated evidence of

general factor dominance.

In conclusion, psychologists must account for multidimensionality of scales in

hierarchical personality measurement. If scales are to be interpreted in a two-tiered

manner, then strictly speaking, unidimensionality is a fiction. This is because two-tiered

interpretation requires conceptual and psychometric differentiation of group factors from

the general factors at the subscale level. This differentiation is necessary for subscale

scores to be meaningful. Therefore, hierarchical personality inventories should be

investigated for their dimensionality by scrutinizing the general factor dominance of

scales while evaluating the specific reliable variance accounted for by subscales.

69

What has to be determined is whether the specific variance measured by the

subscales interferes with unidimensional measurement at the scale level. Therefore, the

question is not whether a scale is unidimensional; the question is whether the scale is

‘unidimensional enough’? This question can only be answered if personality

psychologists investigate how group factors (which are represented by subscales)

compete for variance with the general factors (represented by total scores) using a bifactor

approach.

3.5. Overview of Chapter 3 and a preview of Chapter 4

In this chapter it is demonstrated how the BTI, and possibly other hierarchical

personality inventories, can be investigated for dimensionality using a bifactor analysis.

More importantly, it is demonstrated how unidimensionality/ multidimensionality of

scales are a matter of degree. This chapter also evaluated the dimensionality of the BTI

because unidimensionality, a property required for application to one-dimensional item

response theory frameworks, is required for computer adaptive testing.

This evaluation is further built upon in the next chapter (Chapter 4) where the

BTI scales are fit to the Rasch rating scale model, which is a one-dimensional item

response theory model. For computer adaptive testing to commence, unbiased item

parameters need to be generated. Therefore, the next chapter will evaluate the BTI within

a one-dimensional item response theory model. Possibly problematic items are identified

and the scales of the BTI are optimised for computer adaptive testing.

70

CHAPTER 4: FITTING THE BTI SCALES TO THE RASCH MODEL

“A measuring instrument must not be seriously affected in its measuring function by the object

of measurement. To the extent that its measurement function is so affected, the validity of the

instrument is impaired or limited. If a yardstick measured differently because of the fact that it

was a rug, a picture, or a piece of paper that was measured, then to that extent the

trustworthiness of that yardstick as a measuring device would be impaired.” – L.L. Thurstone

(1928, p.547).

4.1. Introduction

The aim of this study was to examine the psychometric properties of the Basic Traits

Inventory (BTI) using the Rasch rating scale model so that a core set of items could be

identified for use within a computer adaptive testing framework. The BTI is a five-factor

hierarchical personality inventory designed and constructed for the South African context (de

Bruin & Rudnick, 2007; Grobler, 2014; Morgan & de Bruin, 2010; Ramsay et al., 2010; Taylor,

2004, 2008; Taylor & de Bruin, 2006, 2013; Vogt & Laher, 2009). Although the psychometric

properties of the BTI have been investigated using the Rasch model, which is a one dimensional

item response theory model (cf. Grobler, 2014; Taylor, 2008, Taylor & de Bruin, 2013), this

model has not been used to prepare the BTI for computer adaptive testing applications.

Item response theory models, and especially the one-dimensional Rasch model, are

pivotal for the development of computer adaptive tests (Ma, Chien, Wang, Li, & Yui, 2014).

This is because items that are used in computer adaptive testing must be ranked along a single

dimension or construct (Linacre, 2000); function independently from one another in an

invariant manner (Ma et al., 2014); and demonstrate a spread of difficulty/endorsability across

a single latent construct so that persons of varying abilities or trait levels can be measured

71

accurately and precisely (Thompson & Weiss, 2011). Another important consideration in the

development of a computer adaptive test is that test items must demonstrate invariance of

measurement across differing groups so that a test-taker’s standing on the latent trait can be

estimated in an unbiased manner (Zwick, 2009). The Rasch model is well suited to these

requirements because it fits data gathered on items to exactly these specifications (Linacre,

2000).

In the past the Rasch model has been applied to the BTI to investigate and evaluate

Rasch model fit, rating scale functioning, differential item functioning across demographic

groups (Taylor, 2006, 2008, Taylor & de Bruin, 2013), and differential item functioning across

language groups (Grobler, 2014). The BTI has also been investigated extensively using

classical test theory techniques such as factor analysis and scale reliability (Metzer et al., 2014;

Taylor, 2004, 2008).

This study aims to evaluate the psychometric properties of the BTI by implementing

the Rasch rating scale model so that core items can be identified for use in computer adaptive

applications. The process of applying the Rasch measurement model to personality inventories

is discussed in the forthcoming sections.

4.1.1. The use of the Rasch model for computer adaptive test development

Some of the information garnered by fitting personality scale data to the Rasch model

include person ability estimates (Mellenbergh & Vijn, 1981), item difficulty or endorsability

estimates (Embretson & Reise, 2000), item and person spread or dispersion across the range of

the latent trait (Bond & Fox, 2007), person and item reliability (Boone, Staver & Yale, 2014),

and rating scale category functioning (Linacre, 2002a). These item response theory evaluations

are a necessary requirement when building and/or evaluating a measurement scale within an

item response theory model because they determine whether the data meet the assumptions of

72

the model (Bond & Fox, 2007). These assumptions are core requirements for the practical

implementation of a computer adaptive test and it is therefore important that the items of a

scale meet these assumptions for computer adaptive testing to be practically viable (Thompson

& Weiss, 2011; Weiss, 2013).

As computer adaptive tests are heavily reliant on the assumptions of item response

theory models (Weiss, 2013), measurement scales that are used in computer adaptive testing

need to meet two broad requirements namely: (1) that the items of the scale measure a single

latent construct only; and (2) that at any level of the latent construct the probability of endorsing

an item within the measurement scale is unrelated to the probability of endorsing any other

item in that scale (Hambleton & Jones, 1993; Streiner, 2010). These requirements are referred

to as the unidimensionality and local independence assumptions (McDonald, 2009; Sick,

2011).

The unidimensionality assumption is important for computer adaptive tests as each item

is selected by a computer algorithm based on its positional standing or rank on a single latent

construct (Linacre, 2000). The positional standing of items on the latent construct is a core

feature for computer adaptive tests where each item is selected by means of its relative position

along the trait continuum in order to estimate a test-taker’s standing on the latent construct of

interest (Thompson & Weiss, 2011). Of course, the positioning of items on the latent construct

presupposes that each item of the scale must measure a single construct and that the items of a

scale should only differ from one another regarding the level, or standing, on the latent

construct measured (Hogan, 2013). Additionally, because each item is used independently to

estimate a test-taker’s standing on the latent construct, each item must demonstrate

independence from any other item at the scale level (Linacre, 2000, Thompson & Weiss, 2011).

The Rasch model is uniquely suited to this purpose as it assumes that each item of a

measurement scale measures a single construct and that each item’s relative location on the

73

latent construct is independent of any other item’s location on that construct (Marais &

Andrich, 2008). This means that if the data fit the Rasch model each item can be used

independently to estimate a test-taker’s standing on the latent construct in question without

depending on the administration of any other item (Baghaei, 2008; Pallant & Tennant, 2007).

Consequently, the Rasch model fits data to these requirements and evaluates to what degree

the items meet them. In the next section some of the fit statistics that are important for the

evaluation of a scale within the Rasch rating scale model are discussed.

.

4.1.2. The application of Rasch diagnostic criteria for psychometric evaluation

The most basic Rasch fit statistics are the mean square fit statistics that indicate how

well the items of a particular measurement scale measure a single construct (Linacre, 2002b).

These fit statistics are referred to as infit and outift mean square fit statistics (Bond & Fox,

2007). Both the infit and outfit mean square statistics indicate whether items of a scale underfit

or overfit the unidimensional Rasch model (Linacre, 2002b). Where underfit indicates that the

items measure something other than the construct of interest (i.e., multidimensionality); overfit

indicates whether items arbitrarily measure the latent construct without really adding any

unique information beyond other items in the scale (Linacre, 2002b). Essentially, underfit

identifies items that do not measure the construct of interest. Overfit, on the other hand,

identifies items that are redundant when compared to other items in the measurement scale.

Fitting personality measurement scale data to the Rasch model also provides

information on the spread of person ‘traitedness’ and item endorsability on a single common

scale of measurement as well as item and person reliability indices that indicate the

reproducibility and consistency of item and person locations on the common latent construct

(Bond & Fox, 2007). Additionally, if a polytomous rating scale is used, as is the case with most

personality scales, Linacre (2002a) suggests that the performance of the rating scale is

74

investigated in addition to basic item fit statistics. This is important because rating scale

categories must be capable of measuring different amounts of the latent construct and must do

so in their intended manner from a lower to higher level of the latent construct (Linacre, 2002a).

Another important issue that needs to be addressed when preparing a measurement

scale for computer adaptive testing is item bias (Weiss, 2013). It is important to realise that

optimal computer adaptive test functioning of a measurement scale is only possible if there is

very little bias when estimating test-takers’ standing on a latent construct (Eggen & Verschoor,

2006). Possible item bias can be reduced by selecting items that remain invariant across

different demographic groups of test-takers (Linacre, 2000; Zwick, 2009). As mentioned

earlier, invariant measurement is very important for computer adaptive tests because fewer

items are administered than in non-adaptive tests. This makes every response to a particular

item by a specific test-taker pivotal to accurately estimate a test-taker’s standing on the latent

construct (Zwick, 2009). Also, because items with a specific endorsability are selected based

on the responses to previous items, biased measurement by a few items in the measurement

scale can have major repercussions for latent construct estimation and consequently the items

which are selected by a computer adaptive algorithm for administration (Zwick, 2009).

For example, if a single item is biased or underfits the Rasch model, the interim trait

estimate for a particular test taker may not be accurate in computer adaptive testing (Walter et

al., 2007). This inaccuracy will result in the selection of items to administer within the

computer adaptive testing framework that are not necessarily relevant or appropriate to the test-

taker and which may result in the inaccurate and imprecise estimation of the test-taker’s

standing on the latent construct measured. Lai et al. (2003) emphasise the importance of

investigating items for differential item functioning (DIF) within the Rasch model before

computer adaptive test applications are implemented to limit any bias when estimating the

latent trait in a computer adaptive manner.

75

Against this background, this study evaluates the BTI for computer adaptive test

preparation by fitting the scales of the BTI to the Rasch rating scale model and investigating

whether each scale meets the assumptions of the Rasch model. This was done in order to select

a number of optimally functioning core items for use within a computer adaptive testing

framework.

4.2. Method

4.2.1. Participants

Participants were South African adults (n =1962) who were assessed on the BTI for

selection, development, counselling and placement purposes. A convenience sampling

procedure was employed as participants were drawn from an existing database. The mean age

of the participants was 33 years (SD = 9 years, Md = 33 years). There were 936 men (62%)

and 647 women (38%). Participants represented the following ethnic groups: Black (54%),

White (36%), Mixed Race (6%) and Asian (4%). Although men and Whites are

overrepresented with respect to the general South African population, the participants roughly

reflect the white collar working population in South Africa (Statistics South Africa, 2011).

Only the Black and White (n = 1746) participants were used for evaluation of

differential item functioning by ethnicity. The mean age of the Black and White group was 34

years (SD = 9 years, Md = 33 years). This group was composed of 647 women (37%) and 1099

men (63%). All participants completed the BTI in English.

4.2.2. Instrument

The BTI measures the five factors of personality through the use of 193 items

which are in the form of behavioural statements such as “I like to meet people” or “I am

organised” (Taylor & de Bruin, 2012). The BTI is a hierarchical personality inventory

76

and has 24 subscales called facets each of which are integrated into the five-factors of

personality (Taylor & de Bruin, 2006, 2013). The BTI items make use of a five-point

Likert-type response scale with response options that range from (1) ‘Strongly Disagree’

to (5) ‘Strongly Agree’ (Taylor & de Bruin, 2006). The BTI has demonstrated good

reliability on the scale level with Cronbach alpha coefficients above .87 across each of

the five scales (Taylor & de Bruin, 2013). Most of the subscales of the BTI reflect

Cronbach alpha coefficients above .75 (Taylor & de Bruin, 2013). Factor analysis of the

subscales have demonstrated strong support for the construct validity of the big five traits.

Good congruence between the factor-structures for Black and White groups in South

Africa have been demonstrated with Tucker’s phi coefficients > .93 for the five scales

(Metzer et al., 2014; Taylor & de Bruin, 2013).


Each of the five scales of the BTI was fit to the Rasch rating scale model using

Winsteps 3.81 (Linacre, 2014). SPSS 22.0 (IBM Corp, 2013) was used for an analysis of

variance – ANOVA – of item residuals, which were obtained using Winsteps 3.81.

The Rasch rating scale model is a unidimensional measurement model which can

be used when items share the same rating scale (Andrich, 1978, Wright & Masters, 1982).

The Rasch rating scale model estimates item difficulty (or item endorsability) locations,

response category thresholds, and person locations (the person’s relative standing on the

trait) units (Andrich, 1978; Wright & Masters, 1982). The rating scale model for

constructing measures from observations is (Linacre, 2002a; Wright & Douglas, 1986):

77

log( 𝑃𝑛𝑖𝑘/𝑃𝑛𝑖(𝑘−1)) = 𝐵𝑛 − 𝐷𝑖 − 𝐹𝑘

where:

𝑃𝑛𝑖𝑘 is the probability that person n would respond in category k when

encountering item i

𝑃𝑛𝑖(𝑘−1) is the probability that the response to item i would be in response

category k – 1

𝐵𝑛 is the ability level or amount of the latent trait of person n

𝐷𝑖 is the difficulty or endorsability of item i

𝐹𝑘 is the impediment to being observed in rating scale category k relative to

category k – 1

The Rasch rating scale model was selected for the analysis of the BTI scales because

the instrument meets the assumptions of this model (i.e., all the items of the BTI use the

same five-point Likert type scale) and allows the operation of rating scales to be

investigated (Linacre, 2002a). The Rasch rating scale model is a special case of the

Masters partial credit model (Masters, 1982); the Andrich dispersion model (Andrich,

1978) and the Rasch polytomous model (1961) where rating thresholds are constrained

to be equidistant across items and where the item discrimination parameters are held

constant (Koch & Dodd, 1995). The rating scale model was selected over the standard

partial credit model; generalised partial credit model and the graded response model

because the scales of the BTI share the same rating scale structure across all the items

(Ostini & Nering, 2006). It is thus assumed that the difference between the rating scale

thresholds should remain constant as this is congruent to instruments where Likert-type

scales are used in which items and their rating scale are assumed to share the same

meaning across the instrument or scale (Ostini & Nering, 2006). Also, unlike the graded

78

response model or the general partial credit model the item discrimination parameters are

held constant in the rating scale model which is more applicable to a one-parameter item

response theory model like the Rasch model (Hart et al., 2006). The Rasch rating scale

model assumes that the only discrimination parameter between test-takers is the relative

endorsability, or difficulty, of the items administered (Andrich, 1978) whereas the partial

credit model places no restrictions on threshold step values (Masters, 1982) and the

graded response model (Samejima, 1969) includes item discrimination parameters (Koch

& Dodd, 1995). The constrained nature of the Rasch rating scale model thus allow for

invariant measurement where only the probability of endorsing items of a scale is

compared between groups and evaluated for equivalence (Ostini & Nering, 2006). As

previously mentioned invariant measurement across test-takers of varying trait levels

and/or different demographic groups is a key requirement in computer adaptive testing

(Hart et al., 2006).

The following were inspected as per the recommendation of Apple and Neff (2012),

Lai et al. (2003) and Linacre (2000) for each scale of the BTI: summary item and person

fit statistics; individual item fit; person-item dispersion; person separation indices; person

reliability indices; and an evaluation of rating scale category performance. Finally, the

scales were also investigated for differential item functioning for gender and ethnicity.






79



4.2.4.1. Evaluating Rasch fit indices

Fit indices, which indicate how well items of a scale meet the requirements of the Rasch

model, were evaluated for the BTI. Following the recommendations of Bond and Fox

(2007) for the fit of personality inventories that employ polytomous items, infit and outfit

mean square statistics < 1.4 and > .60 were considered acceptable. The infit mean square

is empahasised in this study as it is sensitive to irregularities in the responses to items that

are closely targeted to a person’s ability/trait level (Linacre, 2002). In computer adaptive

testing items are often selected that match as closely as possible the person’s relative trait

level (Kingsbury, 2009), which makes infit statistics more applicable than outfit for the

investigation of Rasch model fit. Infit is inlier sensitive as opposed to outfit that is

sensitive to irregularities in responses to items that do not match a person’s ability/trait

level (Linacre, 2002b). Using the infit mean square therefore avoids, to some degree, the

spurious effect of outliers. The BTI was investigated for fit on the overall scale level as

well as on the item level.

4.2.4.2. Person separation indices and reliability.

To evaluate the degree to which the scales separate persons with different trait

levels, person separation and person reliability coefficients were calculated. Following

the recommendations of Lai et al (2003); Linacre (2014) and Wright and Stone (1999)

person reliabilities > .80 and person separation indices > 2.00 were deemed acceptable.

80

Person-item maps that indicate the relative item endorsability relative to the

sample were also generated to determine whether the items of the BTI scales correspond

to person trait levels.

4.2.4.3. Rating scale performance.

Rating scale performance for each scale of the BTI was evaluated by investigating

whether rating scale categories were used sufficiently by the sample and whether response

categories were progressively and monotonically ordered (Apple & Neff, 2012; Linacre,

2002a). Apple and Neff (2012) and Linacre (2002a) suggest that category thresholds are

ordered from higher endorsability, for lower rating scale categories, to lower

endorsability, for higher rating scale categories. Rating scales were also evaluated for fit

and endorsability so that the outfit mean square statistics were < 2.00 and that differences

in threshold endorsability were > .59 but < 5.00 logits (Apple & Neff, 2012; Bond & Fox,

2007).

4.2.4.4. Differential item functioning

Differential item functioning (DIF) was investigated by using an ANOVA of

standardized residuals, which is a latent variable parametric approach, proposed by

Hagquist and Andrich (2004); and Andrich and Hagquist (2014). Although numerous

parametric and non-parametric DIF detection techniques for dichotomous and

polytomous data are available (cf. Teresi, Ramirez, Lai, & Silver, 2008); the ANOVA of

residuals method has the advantage of detecting whether DIF is statistically significant

(for both uniform and non-uniform DIF) for polytomous data (Andrich & Hagquist,

2014). In general, this technique involves applying a two-way ANOVA to the residuals

of item responses. Using the Rasch framework, standardized residuals for every person

81

for every item on a scale was constructed. Each person was then placed in a number of

class intervals (CI) by either their total scores on the scale, or their person ability/trait

estimates as estimated through the Rasch model. A two-way ANOVA was then conducted

on the residuals of the CI, the grouping variable, and an interaction term between the CI

and group variables so that either variable could be controlled for when diagnosing DIF.

In this study, DIF contrasts between the groups as indicators of effect size are used.

A Bonferroni correction for statistical significance tests, as Andrich and Hagquist (2014)

suggest (cf. Bland & Altman, 1995; Tennant & Pallant, 2007), are also applied. The

expected versus empirical item characteristic curves and non-uniform DIF item

characteristic curves for items indicating misfit and possible DIF are additionally included

to establish the severity of such DIF and to present DIF visually.

Not enough data were sourced for the Coloured and Indian demographic groups

and thus, only Black and White ethnic groups were investigated for differential item

functioning by ethnicity (refer to section 4.2.1 for a breakdown of the participants) and

the full sample was used for identification of DIF by gender. DIF was only considered

problematic if the DIF contrasts were statistically significant and were ≥ |.40| logits. This

was done because significant DIF contrasts do not always result in practically significant

DIF. Although Tennant and Pallant (2007) recommend DIF contrasts of ≥ |.50| logits as

practically significant, a decision was made to eliminate items with a DIF contrast ≥ |.40|

to optimise the core item bank. Any DIF contrast ≥ |.50| after item removal was flagged,

however items showing DIF < |.50| after initial item removal was retained for the core

item bank to avoid over-eliminating too many items from each scale or overfitting the

data to the model (Wright & Linacre, 1994).

82

4.3. Results

4.3.1. BTI scale infit and outfit statistics

The mean summary infit and outfit mean square statistics can be viewed in Table

4.1. The items of the BTI scales evidenced satisfactory infit mean square values which

ranged between .99 (Agreeableness) and 1.04 (Neuroticism). These values are close to

the expected value of 1.00. By contrast the mean outfit demonstrated greater deviation

from 1.00 with values ranging from 1.02 (Extraversion) to 1.07 (Conscientiousness). This

shows the presence of some unexpected responses to items that do not closely match

person trait levels.

Table 4.1

Item and person mean summary infit and outfit statistics

Fit Statistics E N C O A

Person Infit 1.02 1.09 1.17 1.10 1.10

SD .51 .65 .73 .70 .72

Person Outfit 1.02 1.05 1.07 1.06 1.04

SD .56 .63 .61 .64 .65

Item Infit 1.01 1.04 1.00 1.01 .99

SD .19 .21 .31 .29 .26

Item Outfit 1.02 1.05 1.07 1.06 1.04

SD .20 .24 .43 .32 .29

Note. E = Extraversion; N = Neuroticism; C = Conscientiousness; O = Openness; A =

Agreeableness

Inspection of individual item-fit statistics (see Table 4.2 and Table 4.3) indicated

some misfitting items by their infit and outfit mean square statistics respectively. These

83

items were flagged for removal. However, the majority of the items per scale fit the Rasch

rating scale model satisfactorily. Before, flagged items were removed by their infit and

outfit mean square statistics the scales were investigated for person separation, reliability

and rating scale functioning.

Table 4.2

Item infit mean squares for the BTI scales

Extroversion Neuroticism Conscientiousness Openness Agreeableness

Item Infit Item Infit Item Infit Item Infit Item Infit

bti37† 1.41 bti59† 1.68 bti107† 1.89 bti144† 1.99 bti186† 1.65

bti12 1.38 bti64† 1.60 bti77† 1.87 bti143† 1.56 bti168† 1.46

bti34 1.37 bti56 1.39 bti78† 1.69 bti147 1.39 bti187† 1.46



bti3 1.28 bti45 1.21 bti89† 1.43 bti135 1.26 bti169 1.33

bti36 1.26 bti67 1.12 bti90 1.14 bti122 1.25 bti161 1.27








bti2 1.01 bti71 1.01 bti100 1.01 bti152 .98 bti162 .97

bti14 .97 bti70 1.01 bti87 1.00 bti123 .96 bti157 .97

bti10 .96 bti50 .99 bti97 .98 bti148 .94 bti192 .95


84















bti11 .78 bti73 .77 bti94 .77 bti174 .72

bti27 .77 bti65 .74 bti109 .77 bti189 .71

bti29 .77 bti108 .76 bti180 .69

bti19 .76 bti118 .73 bti191 .63

bti93 .73 bti190 .60

bti95 .71

bti99 .70

bti119 .68

bti102 .64


Agreeableness. Items are presented in descending order according to their infit mean squares. † =

items with infit ≥ 1.40.

85

Table 4.3

Item outfit mean squares for the BTI scales

Extroversion Neuroticism Conscientiousness Openness Agreeableness

Item Outfit Item Outfit Item Outfit Item Outfit Item Outfit

bti37† 1.44 bti59† 1.86 bti107† 2.39 bti144† 2.10 bti186† 1.79

bti33 1.40 bti64† 1.72 bti77† 2.35 bti143† 1.77 bti182† 1.64

bti12 1.39 bti56 1.38 bti78† 1.96 bti147† 1.50 bti168† 1.58



bti3 1.30 bti45 1.10 bti89† 1.69 bti135 1.35 bti169 1.40






bti38 1.08 bti55 .99 bti81 1.08 bti136 1.09 bti160 1.14



bti2 .97 bti70 1.05 bti103 1.07 bti123 1.01 bti192 1.01

bti14 .96 bti43 1.01 bti115 1.06 bti127 1.00 bti162 .99

bti35 .96 bti60 1.03 bti91 .94 bti148 .99 bti172 .98







86










bti11 .79 bti73 .72 bti94 .78 bti174 .68

bti27 .78 bti65 .72 bti106 .72 bti189 .70

bti29 .77 bti108 .77 bti180 .67

bti19 .77 bti109 .75 bti191 .63

bti93 .74 bti190 .62

bti95 .70

bti99 .63

bti119 .69

bti102 .63

Note. E = Extraversion; N = Neuroticism; C = Conscientiousness; O = Openness; A = Agreeableness.

Items are presented in descending order according to their outfit mean squares.

† = items with outfit ≥ 1.40

4.3.2. Person separation and reliability indices

Person separation and reliability indices for each of the BTI scales are presented

before the removal of ant items in Table 4.4. Person separation indices ranged between

2.39 and 3.04 and person reliability ranged between .85 and .90 indicating that the scales

of the BTI were able to distinguish between individuals with either a high or low level of

87

the measured personality traits. Cronbach alpha reliability coefficients ranged between

.87 and .94 for the BTI scales indicating acceptable consistency of measurement. These

results were encouraging especially since poor fitting items were included in the analysis.

Table 4.4

Person separation and reliability indices

Construct Person Separation Person Reliability α

Extraversion 2.36 .85 .87

Neuroticism 3.01 .90 .93

Conscientiousness 3.03 .90 .94

Openness 2.39 .85 .88

Agreeableness 2.59 .87 .89

Note. α = Cronbach’s alpha internal consistency reliability

4.3.3. Rating scale performance

Rating scale performance indices for each of the BTI scales can be viewed in Table

4.5. The BTI items employ a five-point rating scale format (1 = Strongly Disagree, 2 =

Disagree; 3 = Neither Agree or Disagree; 4 = Agree; 5 = Strongly Agree). For each of

the five scales the rating scale categories demonstrated a sufficient frequency of usage by

the participants. Outfit mean square statistics for each category of the rating scales

indicated good fit (i.e., outfit mean square < 1.40) with Conscientiousness, Openness and

Agreeableness demonstrating significantly poorer fit for the lowest rating scale category

although the infit mean square for Openness was marginally acceptable. All other

categories evidenced acceptable infit and outfit mean squares. Differences in threshold

endorsability remained within the parameter range for each of the BTI scales indicating

good general threshold separation with no disordering of the thresholds detected.

88

Table 4.5.

Rating scale performance indices

Construct Category Measure Rasch-

Andrich

Thresholds

Percentage

of

Responses

Outfit Infit

Extraversion 1 -2.40 NONE 5.38% 1.25 1.14

2 -1.06 -.92 10.64% .93 .92

3 -.11 -.68 24.20% .90 .92

4 .99 .17 35.84% .91 .92

5 2.71 1.44 23.95% 1.04 1.06

Neuroticism 1 -2.59 NONE 36.58% .98 .95

2 -1.00 -1.27 31.81% .76 .95

3 .02 -.24 17.94% .89 .91

4 1.01 .31 9.87% 1.16 1.06

5 2.54 1.20 3.80% 1.79 1.44

Conscientious

.

1 -2.49 NONE

1.55%

3.01 1.81

2 -1.14 -1.01 4.04% 1.29 1.12

3 -.17 -.76 14.20% .94 .93

4 1.05 .05 38.93% .71 .90

5 2.95 1.72 41.28% .93 .91

Openness 1 -2.35 NONE 3.62% 1.79 1.39

2 -1.03 -.86 8.18% 1.05 1.01

3 -.12 -.64 20.88% .86 .88

89

4 .96 .09 37.67% .85 .95

5 2.68 1.41 29.66% .95 .92

Agreeableness 1 -2.51 NONE 4.45% 1.90 1.49

2 -1.11 -1.07 9.23% .96 .95

3 -.13 -.64 20.78% .83 .87

4 1.04 .11 37.94% .86 .97

5 2.84 1.60 27.60% .95 .94

The person/item dispersion maps, which display the location of items in relation

to the location of persons for each scale, can be viewed in Figures 4.3.3a (Extraversion),

4.3.3b (Neuroticism), 4.3.3c (Conscientiousness), 4.3.3d (Openness), and 4.3.3e

(Agreeableness). The person/item dispersion maps indicate the endorsability of each

item’s Rasch-Andrich threshold (thresholds 1, 2, 3 and 4) for the five-point Likert-type

scale. In general, the Person/item dispersion maps indicate good targeting of items and

persons with item thresholds spread across the trait levels of persons in the sample.

However, the Conscientiousness scale evidenced fewer items targeted at a high level of

the trait than the other scales. Neuroticism evidenced the opposite pattern with fewer

items targeted at lower levels of the trait.

90

Figure 4.3.3b. Person/item distribution for the Neuroticism scale

Figure 4.3.3a Person/item distribution for the Extraversion scale

91

Figure 4.3.3c. Person/item distribution for the Conscientiousness scale

Figure 4.3.3d. Person/item distribution for the Openness scale

92

4.3.4. Differential item functioning

Differential item functioning was investigated for ethnicity (Black and White) and

gender. The DIF contrasts (i.e. the difference in item locations) by ethnic group and by

gender can be viewed in Table 4.6 and Table 4.7, respectively. A DIF contrast with a

negative sign indicates that Black individuals and women scored lower than White

individuals and men respectively. Conversely, a positive sign indicates that Black

individuals and women scored higher than their respective referent groups. An ANOVA

of residuals was used to determine whether DIF was uniform or non-uniform in nature.

Figure 4.3.3e. Person/item distribution for the Agreeableness scale

93

Table 4.6

Practically significant DIF by ethnicity

Extraversion Neuroticism Conscientiousness Openness Agreeableness

Item Contrast Item Contrast Item Contrast Item Contrast Item Contrast

bti1 -.18 bti40 .35 bti77† 1.02 bti121†† .59 bti156 -.36

bti2† -.51 bti41† .39 bti78† .87 bti122† .84 bti157† -.40

bti3† -.51 bti42 .12 bti79† -.49 bti123† -.57 bti158 .16

bti4† -.55 bti43† .40 bti80† -.41 bti124 -.12 bti159 -.24

bti5 -.21 bti44 .37 bti81 .00 bti125 .15 bti160† -.47

bti6 -.34 bti45† .56 bti82 .00 bti126 -.24 bti161 .20

bti7 -.25 bti46 .22 bti83† -.46 bti127 .30 bti162 -.37

bti9 -.14 bti47 -.30 bti84 .28 bti129 .13 bti163 -.33

bti10 .03 bti49 .00 bti86 .12 bti130 -.11 bti164 .25

bti11 .00 bti50 .19 bti87 -.31 bti131 .11 bti165 .00

bti12 .00 bti51 .36 bti88† .49 bti132 .24 bti166 .00

bti13† -.53 bti52 -.32 bti89† .44 bti133 -.25 bti167 -.17

bti14 -.08 bti53 .06 bti90† -.76 bti134 .27 bti168† .49

bti15† .43 bti54 -.29 bti91† -.98 bti135 .26 bti169 .37

94

bti16† .47 bti55 -.36 bti92† -.83 bti136† -.59 bti170 .00

bti17 -.03 bti56 .22 bti93 -.11 bti137 .14 bti171 .00

bti18 -.26 bti57†† -.41 bti94 -.19 bti138 -.19 bti172 .10

bti19 -.15 bti59† -.84 bti95 -.29 bti139 .12 bti173 .16

bti20 .00 bti60 -.35 bti97† .46 bti140 -.28 bti174 -.37

bti21 .13 bti61 .06 bti98† .92 bti141 -.32 bti175 -.16

bti22 .29 bti62† .44 bti99 .21 bti143 -.15 bti176 -.24

bti24 .17 bti63 .24 bti100 .02 bti144 .21 bti177 -.30


bti26 -.36 bti65 .00 bti102 .00 bti146 .14 bti180 -.22


bti28 -.11 bti67 -.21 bti104 -.26 bti148 -.18 bti182† .51

bti29 -.03 bti68 .00 bti105 .14 bti149 .00 bti183 .19

bti30 .07 bti69 -.08 bti106 -.31 bti150† -.48 bti184 .15

bti31† .50 bti70 -.34 bti107 .00 bti151 -.34 bti185 -.15

bti32† .72 bti71† -.61 bti108 -.39 bti152 .11 bti186 .22

bti33 .39 bti72 -.13 bti109 -.36 bti153 -.15 bti187 -.02

bti34† -.50 bti73† .55 bti110† -.47 bti154 -.23 bti188 .12

95

Numerous items evidenced practically significant DIF (i.e. DIF contrasts ≥ |.40|) by

ethnicity. Items with the largest DIF contrasts for ethnicity were observed for the

Conscientiousness scale [i.e., bti77 (DIF contrast = 1.02), bti91 (DIF contrast = -.98), bti78

(DIF contrast = .87, bti92 (DIF contrast = .83)]. Most of the items demonstrated uniform DIF

by ethnicity although, bti121 (Openness), bti57 (Neuroticism), and bti114 (Conscientiousness)

also demonstrated practically significant non-uniform DIF by ethnicity.

Fewer items evidence practically significant DIF by gender. These items demonstrated

uniform DIF by gender with no items demonstrating practically significant non-uniform DIF.

bti35 .25 bti74 .06 bti111† -.58 bti189 -.11

bti36† .66 bti75 .09 bti113† -.62 bti190 .00

bti37 .03 bti114†† .54 bti191 -.09

bti38 .32 bti115 -.08 bti192 .29

bti116 -.07 bti193 .20

bti117 -.20

bti118 -.02

bti119 -.14

bti120 .03

Note. DIF contrast ≥ |.40| are printed in boldface.

† = practically significant uniform DIF only

†† = practically significant uniform and non-uniform DIF (ANOVA of residuals)

96

The Conscientiousness scale was the only scale that evidenced no practically significant DIF

by gender at all.

Table 4.7

Practically significant DIF by gender



bti1 .17 bti40 -.07 bti77 -.26 bti121† -.52 bti156 .34

bti2 .32 bti41 .02 bti78 -.13 bti122 -.39 bti157 .16

bti3 .38 bti42 -.07 bti79 .10 bti123 -.17 bti158 .14



bti6† .41 bti45 -.28 bti82 .00 bti126 .00 bti161 .20


bti9 -.14 bti47 .33 bti84 -.08 bti129 .00 bti163 .00

bti10 -.32 bti49 .00 bti86 -.12 bti130 .19 bti164 -.02


bti12 -.12 bti51 -.28 bti88 -.26 bti132 .00 bti166 .03


97

bti14 .08 bti53 .00 bti90 .16 bti134 .00 bti168 -.12

bti15 -.09 bti54 .38 bti91 .23 bti135 -.03 bti169 .11


bti17 -.33 bti56 .27 bti93 .15 bti137 -.06 bti171 -.38

bti18 -.14 bti57† .40 bti94 .29 bti138 .16 bti172 -.16


bti20 -.22 bti60 .34 bti97 -.11 bti140 .29 bti174 -.23



bti24 -.19 bti63 .05 bti100 -.07 bti144 -.27 bti177 .21

bti25 -.12 bti64 -.19 bti101 -.16 bti145 -.34 bti178 -.32


bti27 -.17 bti66 -.19 bti103 .24 bti147 -.11 bti181 -.10


bti29 .00 bti68† -.49 bti105 -.08 bti149 -.10 bti183 .10


bti31 -.18 bti70 .10 bti107 -.24 bti151† .52 bti185 .00

bti32 -.14 bti71 -.12 bti108 .28 bti152 .10 bti186 .12

98


bti34† .61 bti73† -.42 bti110 .19 bti154 .26 bti188 -.19

bti35 .00 bti74† -.49 bti111 .06 bti189 -.18

bti36 .26 bti75 0.02 bti113 .17 bti190 -.14

bti37 -.08 bti114 -.08 bti191 -.35

bti38 .13 bti115 .08 bti192† -.51

bti116 -.11 bti193 -.07

bti117 -.12

bti118 .00

bti119 .09

bti120 .00

Note.

† = practically significant uniform DIF only

†† = practically significant uniform and non-uniform DIF (ANOVA of residuals)

4.3.5. Criteria for item exclusion from the core item bank

Items that were flagged for removal were broken down into five categories namely

(a) items that fit the Rasch rating scale model poorly and evidenced DIF by gender and/or

ethnicity; (b) items that fit the Rasch rating scale poorly and evidenced no DIF by gender

or ethnicity; (c) items that demonstrated DIF by gender and ethnicity jointly; (d) items

99

that demonstrated DIF by ethnicity alone; and (e) items that demonstrated DIF by gender

alone.

Although some items indicated marginal misfit and no DIF by ethnicity or gender,

this author was conservative and thus also flagged these items for removal. For the

purpose of generating only the best functioning and well performing core item bank for

the BTI computer adaptive test the decision was taken to exclude any poor functioning

items these items for computer adaptive testing. Therefore, all the flagged items identified

were removed. All analyses were re-run and an ANOVA of residuals was generated for

the revised items.

In total, 13 items were flagged for removal from the Extraversion scale; 11 items

from the Neuroticism scale; 18 items from the Conscientiousness scale; 9 items from the

Openness scale; and 8 items from the Agreeableness scale. In total the Extraversion,

Neuroticism, Conscientiousness and Openness scales were left with 23 items each after

flagged items were removed; and the Agreeableness scale was left with 29 items after the

flagged items were removed.

4.3.6. Functioning of the ‘core’ BTI scales

The functioning of the BTI items for each scale was re-investigated after items

identified in section 4.3.3 were removed. All analyses including the infit and outfit mean

squares; person and item separation and reliability; rating scale performance and DIF

analyses were re-run. This was done in order to determine how item removal affected fit

to the Rasch rating scale model, person separation and reliability, rating scale

performance and whether DIF was still present for men and women, and the Black and

White ethnic groups respectively.

100

4.3.6.1. BTI scale infit and outfit statistics after item removal

The mean summary infit and outfit mean square statistics can be viewed in Table

4.8. The items of the BTI scales evidenced improved infit and slightly poorer outfit mean

square values which ranged between 1.00 and 1.08 for the outfit mean square and .99 and

1.02 for the infit mean square. The greatest improvement of fit was evidenced with the

infit and outfit mean square standard deviation values. Although average outfit

deteriorated slightly for some scales, the average infit improved, which may indicate

possibly idiosyncratic responding by some test-takers to items that do not match their trait

levels and not poor fit per se to the Rasch rating scale model.

Table 4.8

Item and person mean summary infit and outfit statistics after item removal

Fit Statistics E N C O A

Person Infit 1.02 1.07 1.10 1.09 1.11

SD .59 .67 .77 .74 .74

Person Outfit 1.01 1.04 1.08 1.05 1.04

SD .59 .65 .75 .70 .68

Item Infit .99 1.02 1.01 1.01 1.00

SD .18 .12 .17 .22 .23

Item Outfit 1.00 1.04 1.08 1.05 1.04

SD .19 .16 .23 .25 .26


Agreeableness

The infit and outfit mean square statistics for each item for each scale of the BTI

can be viewed in Table 4.9 and Table 4.10 respectively. Overall infit and outfit is much

101

improved with the flagged items removed. Although some items still demonstrate poor

outfit, very few items demonstrate poor infit. Only bti12 (Extraversion), bti146 and bti135

(Openness), and bti169 (Agreeableness) evidenced both poor infit and outfit mean square

statistics. Only bti169 indicated a moderately substantial deviation from acceptable infit

and outfit values whereas bti146 and bti135 only deviated slightly from acceptable fit

cutoff values.

Table 4.9

Item infit statistics for the scales of the BTI after flagged items were removed

Extraversion Neuroticism Conscientiousnes

s

Openness Agreeableness

Item Infit Item Infit Item Infit Item Infit Item Infit

bti12† 1.50 bti47 1.31 bti120 1.23 bti146† 1.41 bti169† 1.59

bti38 1.34 bti54 1.16 bti81 1.34 bti135† 1.43 bti161 1.40








bti9 1.00 bti69 1.04 bti116 .99 bti127 1.10 bti158 1.02

bti10 1.00 bti66 1.04 bti104 .95 bti124 1.02 bti172 .99

bti28 .95 bti49 .99 bti118 .86 bti153 .93 bti162 1.05

bti20 .94 bti50 1.04 bti105 .91 bti133 .88 bti170 1.02


102










bti163 .78

bti189 .74

bti174 .73

bti180 .71

bti191 .68

bti190 .64


Agreeableness. Items are presented in descending order according to their infit mean squares. †

= infit ≥ 1.40.

103

Table 4.10

Item outfit statistics for the scales of the BTI after flagged items were removed

Extraversion Neuroticism Conscientiousnes

s

Openness Agreeableness

Item Outfit Item Outfit Item Outfit Item Outfit Item Outfit

bti12† 1.53 bti47 1.38 bti120† 1.48 bti146† 1.62 bti169† 1.69


bti35 1.26 bti61 1.26 bti87 1.38 bti130† 1.45 bti184† 1.51






bti25 1.01 bti46 .98 bti100 1.02 bti137 1.13 bti193 1.15


bti10 1.00 bti66 1.07 bti104 1.09 bti124 .98 bti172 1.07




bti30 .92 bti51 .83 bti94 .95 bti134 .90 bti185 1.04








104


bti163 .78

bti189 .74

bti174 .68

bti180 .70

bti191 .69

bti190 .65


Agreeableness. Items are presented in descending order according to their OUTFIT mean squares. †

= outfit ≥ 1.40.

4.3.6.2. Person separation and reliability indices after item removal

Person separation and reliability indices are presented in Table 4.11. After flagged

item removal person separation indices ranged between 2.12 and 2.59 and person

reliability ranged between .84 and .91. This indicates that with the flagged items removed

almost no deterioration of internal consistency reliability or item/person separation

indices occurred. This means that the BTI scales after flagged item removal maintain their

ability to distinguish between test-takers with a high or low level of the personality trait

measured and does so in a stable and consistent manner.

105

Table 4.11

Person separation and reliability indices after item removal

Construct Number of items

examined

Person Separation Person Reliability α

Extraversion 23 2.12 .82 .84

Neuroticism 23 2.65 .88 .91

Conscient. 23 2.54 .89 .93

Openness 23 2.25 .84 .86

Agreeableness 29 2.59 .87 .90

Note. α = Cronbach’s alpha

4.3.6.3. Rating scale performance indices after item removal

Rating scale performance indices can be viewed in Table 4.12. Each rating scale

category demonstrated a sufficient frequency of usage by the participants after the flagged

items are removed. Outfit mean square statistics for each category of the rating scales

indicated good fit with only Conscientiousness demonstrating an aberrant outfit value for

the first rating scale category. Differences in threshold endorsability remained within the

given parameters indicating good general threshold separation.

106

Table 4.12

Rating scale performance indices after item removal

Construct Category Measure Threshold No. of Responses Percentage

of

Responses

Outfit Infit

Extraversion 1 -2.46 NONE 3010 6.68% 1.20 1.13

2 -1.09 -.99 5384 11.96% .90 .91

3 -.12 -.71 11376 25.26% .89 .92

4 1.02 .20 15651 34.76% .93 .93

5 2.76 1.49 9609 21.34% 1.05 1.07

Neuroticism 1 -2.72 NONE 15482 34.36% .98 .96

2 -1.07 -1.43 14765 32.77% .77 .93

3 .02 -.29 8786 19.50% .92 .92

4 1.08 .37 4539 10.07% 1.14 1.05

5 2.67 1.35 1490 3.31% 1.71 1.37

Conscientious. 1 -2.79 NONE 355 .79% 3.04 1.67

2 -1.36 -1.34 1281 2.84% 1.39 1.12

3 -.27 -.96 5778 12.82% 1.09 .98

4 1.26 .03 18739 41.57% .79 .88

5 3.44 2.27 18929 41.99% .93 .91

Openness 1 -2.56 NONE 2269 3.62% 1.96 1.44

2 -1.14 -1.12 5128 8.18% 1.08 1.03

3 -.14 -.70 13092 20.88% .91 .91

4 1.08 .14 23618 37.67% .84 .93

107

5 2.92 1.68 18596 29.66% .94 .92

Agreeableness 1 -2.68 NONE 1821 3.20% 2.16 1.62

2 -1.23 -1.27 4342 7.64% 1.04 1.00

3 -.15 -.77 11706 20.60% .88 .89

4 1.16 .19 22447 39.50% .81 .91

5 3.07 1.84 16513 29.06% .94 .93

The person/item distribution maps, which indicate the location of items in relation

to persons for each scale, can be viewed in Figure 4.3.5.3a (Extraversion), 4.3.5.3b

(Neuroticism), 4.3.5.3c (Conscientiousness), 4.3.5.3d (Openness), and 4.3.5.3e

(Agreeableness). Although item to person targeting remained good for Extraversion,

Neuroticism, Openness, and Agreeableness; Conscientiousness demonstrated poorer

item/person targeting at low levels of the trait.

Figure 4.3.5.3a. Person/item distribution for the core Extraversion scale

108

Figure 4.3.5.3b. Person/item distribution for the core Neuroticism scale

Figure 4.3.5.3c. Person/item distribution for the core Conscientiousness scale

109

Figure 4.3.5.3d. Person/item distribution for the core Openness scale

Figure 4.3.5.3e. Person/item distribution for the core Agreeableness scale

110

4.3.6.4. Differential item functioning after item removal

With the flagged items removed only a few items evidenced DIF by ethnicity

namely bti41 (Neuroticism); bti84 (Conscientiousness); bti162 and bti169

(Agreeableness) (refer to Table 4.13). Although these items demonstrate practically

significant uniform DIF contrasts for the Black and White groups these DIF contrasts are

only marginally above the |.40| range. It is also evident that the number of items

indicating DIF has been reduced substantially for each scale of the BTI. Since no items

demonstrated DIF > |.50| the DIF for the items of the BTI were considered acceptable.

Table 4.13

Practically significant DIF by ethnicity




bti5 -.21 bti41† .41 bti82 .11 bti125 .17 bti158 .20

bti7 -.26 bti42 .13 bti84† .43 bti126 -.26 bti159 -.24

bti9 -.14 bti44† .40 bti87 -.29 bti127 .33 bti161 .24

bti10 .07 bti46 .24 bti93 -.05 bti129 .14 bti162† -.40

bti11 .00 bti47 -.32 bti94 -.15 bti130 -.12 bti163 -.34




bti18 -.28† bti52 -.34 bti101 .30 bti134 .31 bti167 -.18

111

bti19 -.15 bti53 .06 bti102 .07 bti135 .30 bti169† .44

bti20 .00 bti54 -.32 bti103 .39 bti137 .16 bti170 .03

bti21 .17 bti55 -.38 bti104 -.23 bti138 -.21 bti171 .00



bti25 .23 bti63 .25 bti108 -.39 bti141 -.35 bti174 -.38

bti26 -.38 bti65 .00 bti109 -.36 bti145 -.33 bti175 -.16


bti28 -.10 bti67 -.23 bti116 .00 bti148 -.19 bti178 .00

bti29 -.02 bti69 -.10 bti117 -.16 bti149 .00 bti180 -.22

bti30 .10 bti70 -.36 bti118 .02 bti152 .13 bti181 -.20



bti185 -.14

bti188 .15

bti189 -.10

bti190 .00

bti191 -.08

bti193 .24

Note.

† = practically significant uniform DIF only;

†† = practically significant uniform and non-uniform DIF

Only bti171 demonstrated practically significant DIF by gender for the

Agreeableness scale (refer to Table 4.14). Similarly, to the DIF contrasts obtained for the

ethnic groups, the DIF contrast of bti171 is marginally above the |.40| cut-off range and

112

reflects uniform DIF. Because this DIF was below the |.50| recommended cut-off, the

item was retained.

Table 4.14

Practically significant DIF by gender






bti9 -.07 bti44 -.09 bti87 .20 bti127 .15 bti161 .23




bti14 .17 bti50 -.06 bti99 -.20 bti132 .00 bti165 .02


bti18 -.07 bti52 .12 bti101 -.26 bti134 -.02 bti167 .22

bti19 -.10 bti53 -.07 bti102 -.23 bti135 -.09 bti169 .13



bti22 -.21 bti60 .31 bti105 -.17 bti139 -.38 bti172 -.17




113

bti27 -.11 bti66 -.26 bti115 .00 bti146 -.07 bti176 .00

bti28 .00 bti67 .14 bti116 -.21 bti148 -.12 bti178 -.35


bti30 .00 bti70 .05 bti118 -.02 bti152 .07 bti181 -.10



bti185 .02

bti188 -.20

bti189 -.20

bti190 -.15

bti191 -.38

bti193 -.07

Note.

† = practically significant uniform DIF only;

†† = practically significant uniform and non-uniform DIF

4.3.7. Cross-plotting person parameters of the full-test and the reduced test scales

An important consideration after flagged items are removed is whether the reduced

test measures the same trait as the full-length test (i.e. whether persons’ standings on the

latent trait are equivalent across the full-length and reduced tests) (Linacre, 2010).

Following the recommendations of Linacre (2010) cross-plots of person locations for the

full scales of the BTI and the reduced scales of the BTI were constructed. The standard

errors of the person location parameters (also referred to as theta and represented by θ)

were used to construct 95% confidence intervals around the cross-plots. The cross-plots

114

can be viewed below and include Figure 4.3a (Extraversion); Figure 4.3b (Neuroticism);

Figure 4.3c (Conscientiousness); Figure 4.3d (Openness) and Figure 4.3e

(Agreeableness).

Figure 4.3.6a. Cross Plot of Person Measures for the Full and Core Extraversion Scales

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3 4 5

Per

son

Mea

sure

(E

xtr

aver

sion

Fu

ll S

cale

)

Person Measure (Extraversion Core Scale)

115

Figure 4.3.6b. Cross Plot of Person Measures for the Full and Core Neuroticism Scales

Figure 4.3.6c. Cross Plot of Person Measures for the Full and Core Conscientiousness Scales

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Per

son

Mea

sure

(N

euro

tici

sm F

ull

Sca

le)

Person Measure (Neuroticism Core Scale)

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

-4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Per

son

Mea

sure

(C

on

scie

nti

ou

snes

s F

ull

Sca

le)

Person Measure (Conscientiousness Core Scale)

116

Figure 4.3.6d. Cross Plot of Person Measures for the Full and Core Openness Scales

Figure 4.3.6e. Cross Plot of Person Measures for the Full and Core Agreeableness Scales

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10

11

12

-3 -2 -1 0 1 2 3 4 5 6 7 8 9

Per

son

Mea

sure

(A

gre

eab

len

ess

Fu

ll S

cale

)

Person Measure (Agreeableness Core Scale)

-2

-1

0

1

2

3

4

5

6

7

8

9

-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11Per

son

Mea

sure

(O

pen

nes

s F

ull

Sca

le)

Person Measure (Openness Core Scale)

117

It is evident from the cross-plots that person measures for the core scales are

generally equivalent to person measures for the full scales (i.e. very few cross-plotted

points fall outside the 95% confidence intervals).

For Extraversion most persons with trait estimates between -1.00 and 2.00 logits

demonstrated approximately equivalent trait estimates across the full and core scales

within an acceptable standard error. However, some marginal separation of person

measures appears to occur beyond 2.50 logits partly because there are fewer persons to

stably estimate trait levels at this end of the trait continuum.

Neuroticism also evidenced approximately equivalent trait estimates for person

measures across the full and reduced scales between -2.50 and 1.00 logits. The most

marked deviation occurs below -3.00 logits. This is because there are nearly no persons

with trait levels within these ranges to accurately estimate person measures.

Conscientiousness also demonstrated approximately equivalent trait estimates

across the full and core scales between -1.00 and 3.50 logits with a marked deviation of

person measures for the full and core scales above 3.50 logits. Although a number of

persons are found to have trait estimates above 3.50 logits for Conscientiousness, fewer

items are available at this level of the trait to consistently and accurately estimate person

parameters which may be responsible for the deviation.

Both Openness and Agreeableness demonstrated good equivalence for person

measures for the full and core scales between -1.00 and 2.50 logits. Deviation of person

parameters for the full and core scales above 2.50 logits for both Openness and

Agreeableness may be due to the lack of persons available at these trait levels to

accurately estimate person parameters,

118

In general, the cross-plots of person measures for the full and core scales indicate

good equivalence. Very few individuals fall outside the 95% confidence intervals and a

strong linear relationship is demonstrated between person measures for each scale.

4.4. Discussion

The primary objective of this study was to evaluate and select a set of Rasch-based

best items from each scale of the BTI for use within a computer adaptive framework. To

do so the scales of the BTI were fit to the Rasch rating scale model. Whereas previous

research has used the Rasch model for the evaluation of test-scales for classical test theory

applications this study aimed to evaluate the BTI scales using an item response theory

model for preparation for application to a computer adaptive testing framework.

Consequently, numerous Rasch diagnostic evaluations were conducted which included

the evaluation of fit statistics to determine whether the items of the BTI scales fit the

Rasch rating scale model; rating scale performance to determine whether the rating scale

structure of the BTI performed adequately so that the same rating scale structure can be

used in computer adaptive applications; item separation and reliability of scales to

determine whether each scale sufficiently separates test-takers’ across the trait continuum

and does so in a consistent manner; and finally a DIF analysis for gender and ethnicity

for each scale to determine whether items perform invariantly across different groups.

The results of these analyses will be discussed in the following sections. Possible future

improvements to the BTI for computer adaptive testing are also discussed along with the

limitations of this study.

119

4.4.1. Rasch rating scale model fit

In general, the fit of the items for each scale of the BTI, without any flagged items

removed, was satisfactory. However, some deviation was observable with slight underfit

detected. On inspection of the item fit statistics for each scale of the BTI a number of

items demonstrated aberrant outfit values. The infit was focused upon since sampling

error may inflate outfit statistics and because in computer adaptive testing item

endorsability/difficulty are generally matched closely to person trait, or ability level. This

makes infit more applicable. Numerous items that demonstrated poor infit were identified

and flagged for removal. Upon removal of these items the mean fit summary statistics for

each scale of the BTI were slightly improved. The greatest improvement was observed

with the infit mean squares and the standard deviations of the mean square values. This

indicates that with the removal of the flagged items; precision of measurement improved

for each scale of the BTI.

Inspection of the item fit statistics for each scale after flagged items were removed

indicated only a small number of possibly poor fitting items although infit and outfit mean

sqaures for these possibly problematic left-over items was marginally poor. These items

were not removed for use within the computer adaptive framework in order to conserve

the number of items in the core item bank. Also, over-removal of items may result in the

data overfitting the model, which may negatively affect the generalizability of the BTI

scales and negatively impact item spread and reliability.

4.4.2. Item spread and reliability

After flagged items were removed the person and item reliability indices showed

no marked deterioration. Although the initial reliabilities for the scales of the BTI were

satisfactory item removal may result in the deterioration of item spread and internal

120

consistency reliability. With the removal of a number of poor fitting items the reliabilities

remained robust.

Person and item spread demonstrated no marked deterioration for the core scales.

There were however, proportions of the sample that were not targeted effectively by

certain scale items. For example, Openness and Neuroticism could benefit from more

items at higher levels of the latent trait, whereas Conscientiousness could benefit from

more items targeted at lower levels of the trait. In general, however, the items of the core

scales demonstrated good spread with most scales measuring between -3.00 and 3.00

logits.

4.4.3. Rating scale performance

Rating scale performance was satisfactory for most of the BTI scales. However,

some of the rating scale categories did demonstrate underfit. The most salient of these

was the Conscientiousness scale where rating scale category one indicated poor outfit

mean squares. Although it is recommended that poorly fitting rating scale categories be

collapsed for better fit to the Rasch rating scale model the number of persons whom

indicated a low level of the trait should also be considered when investigating rating scale

fit indices. In the case of Conscientiousness, very few persons in the sample had a low

level of Conscientiousness which may have inflated outfit statistics for the lower rating

scale category. With the removal of flagged items, no substantial deterioration of the

rating scale fit indices was observed from that of the full test. Person and item separation

indices also remained within the required ranges.

121

4.4.4. DIF by ethnicity and gender

Numerous items were identified that demonstrated uniform and non-uniform DIF

by gender and ethnicity. These items were flagged for removal as DIF is a serious

limitation for computer adaptive testing. It is important to realise that practically

significant DIF is often considered only if such DIF is statistically significant and ≥ |.50|.

A decision was made to remove any items demonstrating DIF ≥ |.40| which by most

standards would be considered acceptable for inclusion in an item bank. If the items that

indicated DIF ≥ |.50| were considered, then only a few items from each scale would be

removed after the initial inspection. Only four items demonstrated practically significant

uniform DIF by ethnicity for the core scales and only one item demonstrated practically

significant uniform DIF by gender. The DIF demonstrated by these items were well below

the |.50| recommended cut-off for statistically significant DIF contrasts. It was therefore

decided to retain these items in order to save as many items as possible for our core item

bank.

4.4.5. Conclusion

Overall the scales of the BTI tend to fit the Rasch rating scale model sufficiently

well. However, as computer adaptive tests use fewer items in varying orders for each test-

taker, and are thus much more reliant on the psychometric properties of individual items,

the scales of the BTI were further refined for use as a computer adaptive test. Poor fitting

items that also demonstrated DIF by either gender or ethnicity were removed from each

scale of the BTI so that a core set of items could be identified. This process has improved

the fit of the BTI to the Rasch rating scale model and has also ensured that most of the

items of the BTI scales remain invariant across groups. However, the fit to the Rasch

rating scale model was not perfect with a small number of items showing marginal

122

underfit for the core scales. There were also some items of the core scales that indicated

marginal DIF for ethnicity and gender.

Although most of the deviations from the Rasch rating scale model can be

explained by possible sampling effects, the true performance of the items retained can

only be evaluated through computer adaptive test simulation where person estimates for

the full non-adaptive test are compared for equivalence. However, an encompassing

analysis of the fit of the BTI scales to the Rasch rating scale model has allowed for

improvement of each scale for computer adaptive application. In this regard the items

retained after evaluation with the Rasch rating scale model have improved the chances

that the computer adaptive version of the test using these selected items will function as

optimally as possible. The evaluation has also acted as a benchmark for the psychometric

properties of the BTI which can now be compared to future analyses for computer

adaptive test improvement.

There are however some limitations to this study that need to be addressed. Firstly,

only the Black and White ethnic groups were compared for DIF by ethnicity.

Unfortunately, the sample used for the study did not have sufficient persons for a

comprehensive DIF analysis with the Coloured and Indian ethnic groups. DIF with these

ethnic groups needs to form part of a future study on the item bank of the BTI so that

items indicating DIF for these groups can be identified and removed. Secondly, item

spread and item/person targeting was sufficient but not perfect for computer adaptive item

banking. Numerous items in the Openness, Neuroticism and Conscientiousness scales

will need to be written to target persons with certain levels if the trait more effectively.

Finally, item banks for computer adaptive testing are usually quite large with a sufficient

number of items generated for parallel items to be used in computer adaptive test

applications. Such parallel item banks are necessary to ensure sufficient item exposure

123

for data collection purposes and also for items to not be overused within a computer

adaptive framework. The number of items for each scale of the BTI may need to be

increased at some stage so that alternate form computer adaptive testing can become

possible. It is therefore strongly recommended that more items are written for each scale

of the BTI to ensure proper item exposure for each core item bank when the test is used

in praxis.

In summary, the BTI core scales appear to have satisfactory Rasch rating scale

model fit for use within a computer adaptive testing framework. It is however

recommended that fit to the Rasch rating scale model be replicated with other samples so

that consistently poor functioning items can be removed.

4.5. Overview of the current chapter and preview of the forthcoming chapter

This chapter investigated the fit of each of the BTI scales to the Rasch rating scale

model so that only the best performing items could be selected for use within a computer

adaptive testing framework. Numerous items (59 items) were flagged as possibly

problematic and were consequently removed from the scales of the BTI. In this way only

the best functioning items were retained for use within a computer adaptive testing

framework for each scale of the BTI. The retained items are referred to as the core item

bank. The core item bank will be used in a computer adaptive testing framework to

estimate person parameters in a simulated manner for each scale of the BTI. These person

parameters are then compared to the non-adaptive version of the core item bank in order

to determine whether the adaptive version of the test is equivalent to its non-adaptive

counterpart. The adaptive core item bank will then be compared to both the adaptive and

non-adaptive versions of the full test in order to determine whether the efficiency garnered

124

by the core item bank also included equivalence to the original non-adaptive full-form

test for each BTI scale.

125

CHAPTER 5: AN EVALUATION OF THE COMPUTER ADAPTIVE BTI

“In principle, tests have always been constructed to meet the requirements of test givers and

the expected performance-levels of the test candidates…giving a test that is much too easy for

candidates is likely to be a waste of time…on the other hand, questions that are much too

hard, also produce generally uninformative test results…” Michael Linacre (2000, pg. 4).

5.1. Introduction

The purpose of this study is to simulate the Basic Traits Inventory as a computer

adaptive test for psychometric comparison with its non-computer adaptive counterpart.

The objectives of this study are thus two-fold: 1) to simulate the scales of the BTI as

computer adaptive tests, and 2) to compare the performance and functioning of the

computer adaptive BTI scales to their non-adaptive paper and pencil versions. The

primary purpose of this comparison is to determine whether the computer adaptive test

versions of the BTI scales are metrically equivalent to the non-adaptive paper and pencil

versions of the BTI scales. Metric equivalence between the computer adaptive test scales

and their non-adaptive paper and pencil counterparts is an important step in the

development of a computer adaptive test as it evaluates whether the computer adaptive

test estimates person parameters – the individual test-takers’ standing on the latent trait –

in a similar manner to the non-adaptive test. As the BTI non-adaptive test has been widely

evaluated for validity, reliability and test-fairness in the literature (Grobler, 2014; Metzer

et al., 2014; Taylor & de Bruin, 2006, 2012, 2013; Taylor, 2008) it can be used as a ‘gold

standard’ for the evaluation of the BTI computer adaptive test-scales.

126

Additionally, computer adaptive test simulation can also be used to evaluate the

performance of the computer adaptive BTI test-scales (i.e., item usage, efficiency, and

measurement precision) for direct comparison the non-adaptive BTI scales.

This chapter begins by giving an overview of computer adaptive test simulation

and the computer adaptive algorithms that are implemented in the computer adaptive

testing process. Then the BTI computer adaptive test-scales are compared to their non-

adaptive counterparts through computer adaptive test simulation. This chapter concludes

by evaluating the feasibility considerations of implementing a computer adaptive version

of the BTI and a discussion of the implications for adaptive testing in industry are also

provided.

5.1.1. Computer adaptive test simulation

Computer adaptive test simulation is a process by which items from a non-adaptive

full-form test – the original paper and pencil version of the test – with associated

respondent data, is scored as if in an adaptive manner using a computer adaptive testing

framework (Smits et al., 2011). The computer adaptive framework has pre-specified

algorithms that select and administer each item, based on its pre-calibrated item location

estimate on the latent construct of interest, for simulated administration to the sample of

test-takers (Choi, Reise, Pilkonis, Hays, & Cella (2010b).

Items are in fact not administered in ‘real time’ in the computer adaptive version

of the BTI scales, but are evaluated as if they are adaptive (Choi et al., 2010b). Since all

the response options for each item has already been made in the non-adaptive version of

the test, these response options are used to artificially select items for adaptive

administration in the simulated computer adaptive versions of the BTI scales (Lai et al.,

2003).

127

Therefore, the simulated computer adaptive versions of the BTI scales select items

based on the prior responses of test-takers on the non-adaptive test and also estimate test-

takers’ standing on the latent trait independently from the non-adaptive versions of the

test scales (Dodd, de Ayala, & Koch 1995). Additionally, the computer adaptive test

scales estimate person location parameters with a pre-specified error of estimation, which

gives an indication of the precision of measurement (Weiss, 2004). Additionally, this

adaptive testing process allows real world simulation of a scale as a computer adaptive

test, which allows for direct comparisons between the adaptive and non-adaptive BTI

scales (Reise et al., 2010).

Computer adaptive simulation also establishes construct validity of the computer

adaptive test and allows numerous additional feasibility evaluations to take place (Dodd

et al., 1995; Walter et al., 2007). For example, an evaluation of the number of items used

to estimate person location estimates can be demonstrated and contrasted with the non-

computer adaptive test scale versions (Wang & Shin, 2010; Ware et al., 2000). The

equivalence between the person parameter estimates estimated by the computer adaptive

test and the non-computer adaptive test can also be established (Haley et al., 2006).

Finally, the item exposure rate and the item information functions can be evaluated for

the computer adaptive test-scales in relation to their accuracy – the correlation of person

estimates between the computer adaptive and non-computer adaptive versions of the test

– and its precision as measured through the standard error of person parameter estimates

(Haley et al., 2006).

These evaluations are pivotal for the development of a computer adaptive test as

they demonstrate whether the results obtained using an item-bank in a computer adaptive

manner can be as accurate, precise and efficient as the results obtained from non-

computer adaptive test scales (Wang & Shin, 2010).

128

Before comparisons are made between the computer adaptive BTI scales and their

non-adaptive counterparts, a brief overview of computer adaptive testing and item banks

are given in the next section. Further elaboration is provided on the computer adaptive

test simulation procedures and some evaluative criteria for the evaluation of a computer

adaptive test are also included.

5.1.2. Item banks used in computer adaptive testing

All computer adaptive tests make use of an item bank from which the computer

algorithms select items for administration (Bjorner, Kosinski, & Ware, 2005; Eggen,

2012). All the items in an item bank have difficulty/endorsability estimates – also referred

to as item location parameters/estimates – which have been psychometrically calibrated

through the application of an item response theory model to the data (Linacre, 2000;

Weiss, 2004). Although an item bank is a conglomeration of items; each cluster of items

has a well-defined construct to which the items are attached (Gu & Reckase, 2007).

Therefore, an item bank can be defined as set of items, each of which measures a single

latent construct that have been calibrated psychometrically so that the sets of items in the

item bank measure across a continuum of a single common dimension and in which each

item can be used for independent administration based on the responses of test-takers (Gu

& Reckase, 2007; Lai et al., 2003; Weiss, 2004).

Therefore, each item in an item bank needs to measure a unidimensional construct

across the construct continuum – from lower ability/trait levels to higher ability/trait

levels – and be capable of doing so in an invariant manner (Haley et al., 2006). What is

most important is that each item has an associated item location estimate that has been

established with a calibration sample usually within an item response theory model

(Thompson & Weiss, 2011). These are referred to as the item location parameters and

129

give an indication of the items’ standing on the latent construct under investigation based

on a particular calibration sample (Gershon, 2004). However, developing item parameters

is not limited to item response theory models and the reader is referred to the use of

categorical confirmatory factor analytic procedures that can be used to construct item

parameters (cf. Maydeu-Olivares, 2001; Muthén, 1984).

The generation of item parameters presupposes that each item bank must go

through an item-calibration stage where the item-data are fit to an item response theory

model, or categorical confirmatory factor analytic model. During the item calibration

stage items must be shown to (1) measure a single latent construct only; (2) have a good

spread of measurement across the latent construct continuum and (3) have item

parameters which remain invariant across different groups of test-takers (Gu & Reckase,

2007). Any item that does not meet these requirements must be eliminated from the item

bank before computer adaptive testing can commence. This is because items that do not

meet these requirements will estimate person location parameters in a biased manner

without the necessary precision or accuracy required for stable latent construct estimation

(Gu & Reckase, 2007).

The primary reasons why item banks must meet these criteria relates to the

processes followed in computer adaptive testing. These processes are discussed in the

next section.

5.1.3. Computer adaptive testing

Computer adaptive testing is a process by which items are selected for

administration from an item bank through the use of a computer algorithm so that the

information that is garnered by the test-takers’ responses are optimised in order to

precisely and accurately estimate the latent trait – person location parameters – of test-

130

takers (Lai et al., 2003). In other words, each test-taker is administered a pre-specified

item, or items, to which the test-takers respond (Gershon, 2004). Each response allows an

interim person location estimate – known as the person parameter estimate – to be

generated of the test-taker’s relative standing (location) on the latent construct within

approximate standard error (Veldkamp, 2003). The items of an adaptive scale are

administered so that the estimated person location parameters are estimated with the

lowest possible standard error and the highest possible item information function (Choi

et al., 2010a).

Thus, each consequent item is selected to minimise the standard error of the person

parameter estimation and maximise the information regarding the test-taker’s standing on

the latent construct through the use of various item selection criteria and person parameter

estimation methods (Eggen, 2012; Hol et al., 2005; van der Linden & Pashley 2000). The

number of items selected for administration in the adaptive scale will therefore be

dependent on how quickly a person’s location on the latent construct can be attained

within an acceptable standard error (Choi et al., 2010a).

In most cases the test-taker’s estimated standing on the latent trait will become

increasingly precise with each successive item administered, which is due to a decreased

standard error of the person parameter estimates and thus improvement in the precision

of person location estimation (van der Linden & Pashley, 2000). Usually, computer

adaptive tests will stop administering items when the standard error of the person location

parameter estimates reaches a certain minimum or acceptable level and the test

information function is maximised (Choi et al., 2010a).

The number of items administered, as well as the precision of measurement, is

greatly dependent on the person and item parameter location relative to one another (Choi,

2009). The better the match between these two elements (ongoing within the adaptive

131

process), the quicker and more precise the person parameters can be estimated. To

maximise the precision with which person parameters are estimated, and reduce the

number of items required to estimate these parameters with specified accuracy, the most

optimal item selection rules must be applied. The next section discusses some of the item

selection rules that can be applied to maximise the precision of person parameter

estimation while maximising the efficiency of the items administered.

5.1.3.1. Item selection rules in computer adaptive testing

It is important to realise that administration of particular items is variable and

based on the prior responses of each test-taker to each administered item. Hol et al. (2005)

explain that the process of item selection usually happens using the variable step size

method where each response to an item provides an associated interim person parameter

estimate which informs item selection for the next administered item. This method selects

an item with an interim estimated item location value that is approximately equal to the

person location parameter estimated from the administration of the previous item (Choi

et al., 2010a; Hobson, 2015). In this way no two administrations of a computer adaptive

test are exactly the same for all test-takers in that each item selected for administration

from the item bank is based wholly on the person responses to each prior item

administered (Weiss, 2004). The core aim of a computer adaptive test is therefore to

match person location estimates with item location estimates in order to obtain the most

precise final person location estimate (Hobson, 2015). This process is relatively universal

for most computer adaptive tests although the algorithms employed do differ regarding

the estimators and item selection criteria used to estimate person location parameters.

Additionally, items of each scale can be administered interchangeably as long as each

interim person parameter is linked to a specified item bank, or scale.

132

There are numerous algorithms available for computer adaptive tests which govern

item selection rules and person parameter estimation methods (cf. Choi, 2009; Gu &

Reckase, 2007; van der Linden & Pashley, 2000). However, the most popular criteria for

item selection in computer adaptive testing is the maximum information criterion

developed by Brown and Weiss (1977) and the maximum posterior precision criterion

developed by Owen (1975) using a maximum likelihood estimator (Gu & Reckase, 2007).

In general, the maximum information criterion – also referred to as the maximum Fisher

information criterion – selects items that maximises the information function when the

current person location parameter for a test-taker is being estimated (Gu & Reckase,

2007). The maximum information criterion allows for an ever increasing precision of

person parameter estimation with the successive administration of each item (Gu &

Reckase, 2007).

The maximum posterior precision criterion is similar to the maximum information

criterion except that the item selection process focuses on minimising the posterior

variance of the person location estimator (van der Linden, 1998). There are other item

selection and estimation criteria such as the maximum global-information criterion; the

likelihood-weighted information criterion; the maximum expected posterior weighted

information criterion; the maximum interval information criterion; the posterior expected

Kullback-Leibler information criterion; and the Bayesian collateral information criterion

to name a few (Choi & Swartz, 2009; Gu & Reckase, 2007; van der Linden, 1998; van

der Linden & Pashley, 2000; Veldkamp, 2003). The purpose of this study is not to give

an overview of the criteria or estimators used for item selection and person location

estimation. For a more in-depth discussion of the criteria for item selection and the

estimators used in computer adaptive tests refer to Gu and Reckase (2007); van der

Linden (1998); and van der Linden and Pashley (2000).

133

It is important to note however that Choi and Swartz (2010) found no substantial

difference in the person estimates or efficiency of item selection for more advanced and

complex item selection and person parameter estimation techniques used with

polytomous items. Similarly, van der Linden and Pashley (2000) also demonstrated that

when more than ten items are administered in computer adaptive testing with

dichotomous items, the person parameter estimates and item efficiency are similar for

most Bayesian item selection criteria. These criteria include the maximum expected

posterior weighted information; the maximum expected information; and the minimum

expected posterior variance.

However, Penfield (2006) demonstrated that the simpler maximum posterior

weighted information criterion outperformed the popular maximum Fisher information

criterion and performed similarly to more advanced Bayesian methods such as the

maximum expected information criterion. Choi and Swartz (2009) also demonstrated that

the maximum posterior weighted information is an accurate and precise item selection

criterion for polytomous items when compared to the more advanced and complex

Bayesian criteria. The maximum posterior weighted information criterion is also utilised

extensively with computer adaptive tests that use polytomous item responses and in which

there may be uncertainty about the initial precision of person parameter estimates (Choi

et al., 2010b). This criterion is often used with two-stage branching where an initial

item(s) in an item bank is administered to all the test takers after which the remainder of

the items are administered adaptively (Choi et al., 2010b).

Such a two stage branching process allows a single item – or a number of items –

that have the highest information function to be used for more accurate initial person

location estimation which is advantageous when the most precise and accurate person

location estimates are desired (Choi et al., 2010b).

134

For these reasons the maximum posterior weighted information criterion may be

the best suited criterion to use with polytomous personality or attitudinal inventory items.

The posterior weighted information criterion was therefore used in this study as the items

adhere to these characteristics and the method provides the best estimation for

polytomous items.

In summary therefore, computer adaptive tests select items from an item bank for

administration based on the preliminary person parameter estimation of individual test-

takers through the use of specified item-selection criteria. However, where computer

adaptive tests do differ substantially from one another is with their respective stopping

rules (Babcock & Weiss, 2009).

5.1.3.2. Stopping rules in computer adaptive testing

A stopping rule is an algorithm that instructs the computer adaptive test to stop

administering additional items to a test-taker once a certain precision of measurement –

or some other in-test criteria – has been attained (Babcock & Weiss, 2009). Stopping rules

are also referred to as termination criteria (Babcock & Weiss, 2009; Choi et al., 2010a).

Usually stopping rules are set in variable-length computer adaptive tests when the

standard error of the person parameter estimate reaches and acceptable level, or if the

remaining unadministered items in the item bank add very little information beyond the

items already administered (Choi et al., 2010a). Alternatively, all the items in an item

bank can be used in an adaptive manner which is usually the case in fixed-length computer

adaptive tests (Babcock & Weiss, 2009).

The most commonly used stopping rule for computer adaptive tests is the standard

error criterion (Choi et al., 2010a; Hogan, 2014). This method instructs the test to stop

administering items when the person parameter estimate of a particular test-taker has been

135

estimated within a minimally acceptable standard error (Zhou, 2012) or if successive

items administered add little cumulative information – or result in minimal improvement

in the standard error of the parameter estimates – than the previously administered items

(Babcock & Weiss, 2009). The major advantage of the standard error stopping rule is that

each person’s parameter estimate needs to be estimated with a certain precision before

the adaptive test-scale stops administering items. The standard error stopping rule thus

tries to ensure that a certain acceptable precision of measurement is achieved for each

test-taker.

Although the standard error criterion is popular, it may undermine the efficiency

of computer adaptive tests because not all person parameter estimates may meet the pre-

specified standard error no matter how many items are administered (Choi et al., 2010b).

This is most prevalent for persons who have a standing on the latent trait that are not well

targeted to the item location parameters of the items in the item bank (Segall, 2005; Segall

& Moreno, 1999). In such cases, all the items in the item bank are often used to estimate

person parameters while these parameters are often not estimated with an acceptable level

of precision. A way to remedy this lack of precision is to write more items that improve

the spread of the item parameters across the construct continuum (Ortner, 2008).

Therefore, the effectiveness of the standard error technique is bolstered or hampered by

the quality and spread of the items in the item bank and the degree to which such items

are targeted, or matched, to person trait levels (Ortner, 2008).

5.1.3.3. Evaluation of computer adaptive test performance

It is evident from the characteristics and processes employed in computer adaptive

testing that the technique diverges greatly from standard fixed length non-adaptive testing

process (Thompson & Way, 2007). Because of this, item banks need to adhere to the

136

requirements set out in section 5.1.2 so that person parameter estimation is as error free

and unbiased as possible. It is also important that computer adaptive tests are compared

to their non-computer adaptive counterparts to determine whether such tests are able to

estimate the standing of persons on the latent trait in an equivalent manner to the non-

computer adaptive tests (Hol et al., 2008). Since the non-adaptive versions of computer

adaptive tests are often well researched with good psychometric properties; comparing

the person parameter estimates of the computer adaptive test-scales to the estimates

procured from the non-adaptive test helps to establish the validity of the computer

adaptive test-scales (Hol et al., 2008). Therefore, the person location estimates of the non-

adaptive test scales provide a benchmark, or ‘gold standard’, for determining whether the

computer adaptive test can attain the same, or similar, person parameter estimates with

greater item efficiency (Vispoel, Rocklin, & Wang, 1994). This is especially important

when applying preselected item selection criteria and stopping rules so that the

effectiveness of these algorithms can be determined.

What is also important is the comparison of a computer adaptive optimised item

bank (a computer adaptive test with fewer, optimised, items than the full-form test) to the

computer adaptive full un-optimised item bank. The optimised item bank refers to items

that have been evaluated for dimensionality, and fit to an item response theory model

(refer to Chapter 3 and Chapter 4). More importantly, an optimised item bank is one where

the criterion of invariant measurement is met so that any items that demonstrate

differential item functioning across trait levels for certain groups of people (i.e., groups

of different ethnicity, gender, first language etc.) are eliminated from the test (refer to

Chapter 4). It is important to compare the computer adaptive functioning of the optimised

item bank to both the non-computer adaptive full form test and the un-optimised full form

computer adaptive item bank so that its performance and efficiency can be established

137

within a computer adaptive framework (Choi et al., 2010b). The reason why the optimised

and un-optimised adaptive and non-adaptive test scales are compared to one another is to

determine whether the removal of poor fitting items –items than demonstrate differential

item functioning or poor fit to the Rasch rating scale model – improves or deteriorates the

precision and accuracy of person location estimation. This precision and accuracy is

usually inferred from the degree of deviation that the person parameter estimates

demonstrate - in the adaptive tests – when compared to the person parameters obtained in

the non-adaptive full form test which, as mentioned earlier, acts as a benchmark

(Thompson & Weiss, 2011). Computer adaptive tests that make use of an un-optimised

item bank often estimate person location parameters with less precision and accuracy than

those with an optimised item-bank (Veldkamp & van der Linden, 2000). This is because

computer adaptive tests are notoriously sensitive to poor functioning items (i.e.,

multidimensionality and differential item functioning) as person location estimation is

based on interim trait location estimates which in turn inform subsequent item selection

in the computer adaptive process (Weiss, 2011; Wise & Kingsbury, 2000). Consequently,

a single poor item can result in imprecise and inaccurate interim person parameter

estimates that may result in the selection and administration of items that do not closely

match the test-takers true trait estimate thus increasing the standard error of the interim

and final person trait estimates (Wise & Kingsbury, 2000). Therefore, testing the

dimensionality and calibrating items using an item response theory model are important

steps that need to be taken before an item bank can be used in a computer adaptive manner

(Linacre, 2000).

Using the most popular criteria, or algorithms, mentioned earlier, a computer

adaptive test also has to be evaluated in terms of its precision, accuracy and efficiency

when compared to its non-computer adaptive counterpart – this also includes the

138

evaluation and comparison of the adaptive un-optimised item bank. A way to evaluate

how a computer adaptive test with an optimised item bank compares to its non-computer

adaptive and un-optimised computer adaptive counterpart is to simulate the process with

real respondent data and compare the results of such computer adaptive test simulation

across the different test forms (Choi et al., 2010a). This process is discussed in the next

section.

5.1.3.4. Simulating a computer adaptive test

Simulation of a computer adaptive test with ‘real’ respondent data is somewhat of

a misnomer. This is because the process of simulation is almost entirely similar to real-

world administration of a computer adaptive test. The only difference between computer

adaptive test simulation and real-world computer adaptive testing is that responses in a

simulated computer adaptive test are taken from the full form non-adaptive test data (after

all the responses are garneted for all the items on the full fixed form test) and run within

the computer adaptive framework, whereas in real-time computer adaptive testing

responses to the items happen in real-time as the test-takers are tested (Hart et al., 2006;

Smits et al., 2011). In other words, a real-world computer adaptive test will administer

items to a respondent and select the next item based on the respondents’ responses in real-

time (Fliege et al., 2005). On the other hand, in computer adaptive test simulation test-

takers complete the non-computer adaptive test and their responses are used post hoc to

simulate the computer adaptive test as if its items were being used adaptively in real-time

(Hogan, 2014). There are other simulation techniques such as generating a simulated

number of respondents with varying trait levels and then using these simulated person

parameter estimates in a computer adaptive testing framework (Fliege et al., 2005; Walter

et al., 2007). However, simulating test-taker responses, and their standing on the latent

139

trait, is a technique that is further removed from real-time testing and thus simulating

computer adaptive tests with real respondent data is considered a closer approximation to

real-time computer adaptive testing. In essence, simulating the test-taker’s standing on

the latent trait is more akin to a true simulation of a computer adaptive test which is why

this technique is used in this study (Wang, Bo-Pan, & Harris, 1999).

The two processes, computer adaptive test simulation with real respondent data

and real-world computer adaptive testing, only differ in respect to the test-takers’

probable psychological reaction when given items that are targeted to their pre-estimated

trait level in real-world setting (Ortner & Caspers, 2011). This is because the fixed form

non-adaptive test gives each participant the same items in a sequential order whereas the

computer adaptive test selects the most applicable items for the test-taker from prior

responses to the items administered (i.e., items are selected based on the interim person

parameter estimates estimated with the administration of previous items). Thus, computer

adaptive test simulation uses the responses to items which have been garnered from a

non-computer adaptive fixed form test which is then applied to a computer adaptive

testing framework with its associated algorithms (Hart et al., 2006).

Consequently, the researcher has no information on how the test-takers may have

reacted psychologically to the mode of testing and whether this has an impact on the trait

estimates attained through computer adaptive testing. This is because the test-takers do

not actually complete the test in an adaptive manner, only their responses are used from

the fixed form non-adaptive version of the test.

The literature does indicate that test-mode differences can affect results. For

example, Ortner and Caspers (2011) found that there were significant differences between

the ability scores of test-takers with higher anxiety levels than those with lower anxiety

levels when administered a computer adaptive test of ability. Hart et al. (2006) also admit

140

that the mode of testing may play a role in the responses test-takers give to items

administered in a real-world computer adaptive test. Therefore, the only limitation of

using a simulated computer adaptive test with real post hoc respondent data to evaluate a

computer adaptive test is that the psychological impact of the mode of testing is not

evaluated.

However, computer adaptive test simulation is still very useful for the comparison

of computer adaptive test-scales with optimised item banks to their non-computer

adaptive test-scale counterparts. Additionally, the computer adaptive test with an un-

optimised item-bank can also be compared to its optimised counterpart to determine how

the item optimisation process has altered measurement.

Another mitigating factor regarding psychological test-mode differences is that

these differences may be less amplified with attitudinal measures, such as personality

tests, where the test-taker is responding to questions about their character and behaviour.

This is contrasted with computer adaptive ability testing where test-anxiety is more far

more pronounced and where the effects of the mode of testing may be more amplified

(Ortner & Caspers, 2011).

Notwithstanding the effect of the mode of testing, computer adaptive test

simulation remains a very useful technique for the evaluation and comparison – to the

fixed form test – of the psychometric properties of a computer adaptive test. Some of the

comparisons that can be made using such a technique include: (a) to what degree the

computer adaptive test scales with optimised item banks recover the trait estimates of the

non-computer adaptive test-scales; (b) how efficient, in terms of item usage, the computer

adaptive test with an optimised item bank is when compared to its non-computer adaptive

counterpart; and (c) whether the administration of items in a computer adaptive manner

141

estimates respondents’ standings on the latent trait equivalently to the fixed form test and

whether it does so with rigorous precision.

Against this background, the current study aims to evaluate the computer adaptive

version of each scale of the BTI with their non-computer adaptive counterparts as well as

the full-form un-optimised computer adaptive versions. Therefore, this study compared

the psychometric properties and person parameter estimates of four different test-forms

of the scales of the BTI, namely: (1) the non-adaptive optimised item banks calibrated in

Chapter 4 (referred to as the non-adaptive core test-scale versions); (2) the computer

adaptive optimised item banks (referred to as the adaptive core test-scale versions); (3)

the non-computer adaptive full form item banks (referred to as the non-adaptive full test-

scale versions); and (4) the computer adaptive un-optimised full form item banks (referred

to as the adaptive full test-scale versions). These four test versions of the BTI were

investigated for each scale of the inventory, namely Extraversion, Neuroticism,

Conscientiousness, Openness, and Agreeableness.

The primary objectives and stages of investigation of this study were therefore to

(1) compare the non-adaptive core test-scales of the BTI with the non-adaptive full test-

scales; (2) compare the adaptive core test-scales with the non-adaptive core test-scales;

(3) compare the adaptive full test-scales with the non-adaptive full test-scales; (4)

compare the adaptive full test-scales with the non-adaptive core test-scales; (5) compare

the adaptive core test-scales with the non-adaptive full test-scales; and (6) compare the

adaptive full test-scales with the adaptive core test-scales. These steps will be discussed

in more depth in the method section of this chapter.

142

5.2. Method

5.2.1. Participants

Participants were selected from an existing database of working adults tested for

development and selection purposes. Consequently, the sample was drawn using the

convenience sampling method. The sample was composed of 1,962 South African adults who

completed the non-adaptive full version of the BTI. Participants represented men (62%) and

women (38%) with a mean age of 33 years (SD = 9.08, Md = 33 years) from various provinces

in South Africa. Ethnically, participants were composed of Black (54%), White (36%), Mixed

Race (6%) and Asian (4%) ethnicities. All participants completed the BTI in English.

5.2.2. Instrument

The BTI items makes use of a five-point Likert-type polytomous response scale

with response options that range from (1) ‘Strongly Disagree’ to (5) ‘Strongly Agree’

(Taylor & de Bruin, 2006, 2013). The BTI demonstrated satisfactory reliability on the

scale level with each of the five scales yielding Cronbach alpha coefficients above .87

across different South African ethnic groups (Taylor & de Bruin, 2013). Further factor

analyses indicated good congruence between the factor structures for Black and White

groups in South Africa with Tucker’s phi coefficients > .93 for the five scales (Taylor &

de Bruin, 2013).

Each scale of the BTI was further shortened and optimised by fitting the scales to

the Rasch rating scale model (refer to Chapter 4). A shortened optimised version of each

test-scale, which is composed of the core item banks to be used for computer adaptive

testing, was thus developed. Four of the five core scale were composed of 23 items,

namely for Extraversion, Neuroticism, Conscientiousness and Openness respectively

with Agreeableness optimised to 29 items. The original scales were composed of 36 items

143

for Extraversion, 34 items for Neuroticism, 41 items for Conscientiousness, 32 items for

Openness, and 37 items for Agreeableness. The core test-scale versions were thus

considerably shorter than the original fixed form test scales. The items for the core test-

scales were selected based on their fit to the Rasch rating scale model and a lack of any

practically and statistically significant differential item functioning. Item and person

parameters for the non-adaptive full test-scales and the non-adaptive core test-scales were

generated so that these parameters could be compared with the person parameters

estimated in the non-computer adaptive fixed form test versions.


Computer adaptive test simulations were conducted using the Firestar computer

adaptive testing simulation software program version 1.2.2 developed by Choi (2009).

The Firestar computer adaptive simulation software uses the R framework (R Core Team,

2013) for statistical computing. This framework was used to generate adaptive core and

adaptive full person parameter estimates for comparison. Item parameters, person

responses, and person parameters were generated for the non-adaptive full version and

the non-adaptive core item bank for each scale of the BTI using the Rasch rating scale

model in Winsteps 3.81 (Linacre, 2014). Item parameters, person responses and person

parameters were then input into the Firstar framework for computer adaptive test

simulation using pre-selected item selection, person parameter estimation and stopping

rules. These rules are briefly discussed in the following sections.

5.2.3.1. Item selection rules

For computer adaptive test simulation, the maximum posterior weighted

information criterion was used (van der Linden, 1998) for item selection with an expected

144

a priori estimator (Bock & Mislevy, 1982) as recommended by Choi et al. (2010) for use

with polytomous items (refer to section 5.1.3.1). The two-stage branching technique was

used, where a single fixed item with the highest information function based on the

posterior distribution of items was administered to all respondents before the remaining

items were administered adaptively. The two-stage branching technique allows interim

person parameter estimates to be more precisely estimated and is recommended for use

with items that use a polytomous responses, or with test scales where interim person

parameter estimation may be unreliable (Choi et al., 2010a).

5.2.3.2. Selection of item response theory model

All computer adaptive test simulations were conducted using item and person

parameter estimates obtained with the Rasch rating scale model (Wright & Douglas,

1986), which is a special case of the partial credit model where item discrimination

parameters are held constant and rating thresholds are constrained to be equidistant across

all the items of a scale (Andrich, 1978). This was done in the Firestar computer adaptive

testing framework by setting the item parameter discrimination index to one for all items

and using the Generalized Partial Credit Model option (GPCM, Choi, 2009).

5.2.3.3. Selection of stopping rules

Stopping rules were based on the standard error of person parameter estimates

where the computer adaptive test would stop administering items once an acceptable level

of measurement error for the person trait estimates was attained (Choi et al., 2009). The

standard error of person parameter estimation criterion used in item response theory is

related to the reliability of a test-scale in classical test theory where a measurement scale

with an internal consistency reliability of .90 corresponds to a standard error of about .33

145

(Rudner, 2014). Therefore, the standard error stopping rule was set to .30 in order to

estimate person parameter values as precisely as possible. It should therefore be noted

that the standard error of person parameter estimation was set to a very stringent level.

Wang and Shin (2010) recommends that for most test-takers their estimated person

parameters should fall within the 95% confidence intervals within a specified standard

error for acceptable precision of person parameter estimation. The standard error of the

parameter estimates stopping rule was used instead of the maximum information

termination criterion because this technique allows for more precise person-parameter

estimation (Wang & Shin, 2010).

5.2.3.4. Evaluation criteria for the computer adaptive test versions

The Pearson product moment correlation between person parameter estimates for

the various test forms (i.e., adaptive core, non-adaptive core, non-adaptive full and

adaptive full) was used to establish how well the adaptive core test-scales functioned.

According to Hol et al. (2008) and Wang & Shin (2010) a correlation ≥ .90 between

person parameter estimates generate for the adaptive and non-adaptive test-scale versions

reflect a high level of person parameter equivalence between these test form versions.

The shared variance between person parameter estimates (𝑅2) was also calculated to

compare between the person parameter estimates of various test forms.

5.2.3.4.1. Comparison 1: Comparing the non-adaptive core to the non-

adaptive full test-scales

The first step was to evaluate the correlations between the person parameter

estimates of the non-adaptive core test-scales with the non-adaptive full test-scales of the

BTI because this allows us to establish a baseline on the equivalence of the two non-

146

computer adaptive forms of the BTI scales. Equivalence between these two forms

indicates that the two test-scale forms measure the same construct despite the fact that

fewer items are available in the core versions. In essence, this first comparison helps to

eliminate the possibility that the shortened and optimised nature of the core test-scales

changed the estimation of person parameters substantially and thus allowed us to control

for this when investigating other properties such as the effect of adaptive administration.

5.2.3.4.2. Comparison 2: Comparing the adaptive and non-adaptive versions of

the test

Secondly, the correlation of the person parameter estimates of the adaptive core

version of the test-scales were evaluated against the corresponding non-adaptive core test-

scale counterparts. The same procedure was used with the adaptive full test-scales and

their non-adaptive full test-scale counterparts. This helped to determine to what degree

the adaptive nature of the test-scales possibly altered the psychometric properties of these

scales. Therefore, if equivalence between the person parameter estimates of the adaptive

and non-adaptive versions of each test-scale was reached, then it could be assumed that

the adaptive nature of the test-scales did not significantly affect the person parameter

estimates of the original non-computer adaptive test-scales in any substantial way. On the

other hand, if equivalence between the person parameter estimates of these test forms was

not found it may indicate that the adaptive nature of the test-scales changed the way

person parameters were estimated for the different test forms. Of course, this is dependent

on whether equivalence between the person parameter estimates of the non-adaptive core

and adaptive full test-scales was reached in the first comparison. This is because poor

equivalence at this stage may indicate fundamental differences related to the items used

in the test-scales and not the adaptive nature of these scales per se.

147

5.2.3.4.3. Comparison 3: Comparing the adaptive core test with the non-adaptive

full versions of the test

The third step was to evaluate the correlations between the person parameter

estimates of the adaptive core version of the test-scales with the person parameter

estimates of the non-adaptive full version of the test-scales and vice versa. This

comparison is the focus of most computer adaptive test evaluations (cf. Choi et al., 2010b;

Hart et al., 2006; Hogan, 2014; Smits et al., 2011; Walter et al., 2007). In this comparison

two sets of correlations between person parameter estimates were evaluated.

Firstly, the person parameter estimates of the adaptive core versions of the test-

scales were compared to the person parameter estimates of the non-adaptive full form

versions of the test-scales. This was done to establish whether the adaptive core versions

of the test-scales were able to recover the person parameter estimates of the non-adaptive

full versions of the test-scales. If equivalence between person parameter estimates of these

two forms were found it would mean that the adaptive nature of the core version of the

test-scales did not interfere with person parameter estimation in any substantial manner.

Cross-plots of person parameter estimates were generated using the adaptive core and

non-adaptive full test-scale forms because the adaptive core versions of the test-scales

would be used in the practical computer adaptive testing process. These cross-plots were

presented with the person parameter estimates’ 95% standard error confidence interval

bands to see whether the person parameter estimates of the full-form non-adaptive test-

scales differed substantially from the person parameter estimates estimated using the

adaptive core versions of the test-scales. If the majority of parameter estimates fell outside

the 95% confidence intervals then the test-scale forms could be considered non-

148

equivalent, the opposite is true if most of the cross-plotted person parameter estimates fall

within the 95% confidence interval bands.

If there was little equivalence between the person parameter estimates of these two

test-scale forms, it may indicate that the adaptive core versions were not measuring the

same construct(s) in the same manner as the non-adaptive full version of the test-scales.

The opposite was also investigated where the person parameter estimates of the

adaptive full version of the test-scales were compared to the person parameter estimates

of the non-adaptive core test-scale versions. This informs us as to whether the adaptive

nature of the full version of the test-scales resulted in significant changes to the way

person parameters were estimated for test-takers. Of course this step is dependent on the

equivalence between the person parameter estimates of the test forms established in the

first and second BTI scale comparisons.

5.2.3.4.4. Comparison 4: Comparing the adaptive core test with the adaptive full

versions of the test

The final step in the process of investigating the correlations of person parameter

estimates of the computer adaptive version of the BTI was to compare the person

parameter estimates of the adaptive core version of the test-scales with the person

parameter estimates of the adaptive full version of the test-scales. This is a very important

step as the optimisation process of the core version of the test-scales is put into question.

As demonstrated in Chapter 3 and 4, item banks need to meet numerous criteria before

they are considered ready for practical computer adaptive testing. If the full form of the

test-scale, when simulated in a computer adaptive manner, generates person parameter

estimates that are equivalent to the core version of the test-scales it indicates that the

extensive preparation that the item bank has undergone may have been fruitless.

149

However, if there is poor equivalence between the person parameter estimates of the two

test forms, then this may also indicate that fundamentally the two forms are not measuring

the same construct. Therefore, this stage should indicate person parameter estimate

equivalence to some moderate degree for both versions for the same construct to be

measured; and for the adaptive core version to demonstrate ‘purer’ measurement after

extensive preparation for computer adaptive testing (refer to Chapter 3 and 4). Again, this

comparison is only useful if it has been established from Comparison 1 that the two forms

of the test-scales with different numbers of items were equivalent; and from Comparison

2 that the adaptive nature of the test did not interfere with measurement of the construct

of interest; and finally from Comparison 3 that the adaptive core test was able to recover

the person parameter estimates of the non-adaptive full test. Once these aspects were

investigated the comparison between adaptive versions of the core and full test-scales

could be evaluated to determine whether extensive preparation of an item bank for

computer adaptive testing resulted in optimising measurement.

5.2.3.5. Other performance criteria for computer adaptive tests

Although comparison of the correlations between the person parameter estimates

of test-forms is important to determine whether the adaptive core versions can recover the

original person parameter estimates of the non-adaptive full test-scales; other

performance indices must also be evaluated. More specifically, the performance of the

adaptive core test-scale versions must be compared to the adaptive full test-scale versions

in order to determine whether the adaptive core test-scales perform more effectively than

the adaptive full version of the test-scales. Choi et al. (2010), Thompson and Weiss (2011)

and Hogan (2014) recommend that comparisons be made between the adaptive full form

150

and adaptive core test-scale versions in order to determine whether the adaptive core test-

scales outperform the longer adaptive test-scales.

These psychometric comparisons include item usage statistics such as the number of

items administered over the trait continuum and the mean number of items used; the mean

item information functions of the items administered including the maximum attainable

information within a specific trait level; and the mean standard error of person parameter

estimation (Wang & Shin, 2010). In general, all these comparisons are relative, in that

they are compared across test forms. However, Wang and Shin (2010) suggests that these

criteria must favour the computer adaptive test version of the optimised item bank to

justify the use of this form of testing.








5.3. Results

Firstly, the correlations between person parameter estimates for the adaptive core

and adaptive full test-scale versions were presented using the expected a priori (EAP)

estimator. The results of the correlations between the person parameter estimates of the

different version of the BTI scales namely the adaptive core, non-adaptive core, non-

adaptive full and adaptive full test-scale versions were then presented. A summary of the

151

item usage statistics, mean standard error of person parameter estimation and mean person

parameter estimates were also presented to evaluate the efficiency of the adaptive full and

adaptive core test-scales. Finally, the efficiency of each test-form was compared with

reference to the item usage statistics.

5.3.1. Comparing person parameter estimates of the different BTI scales

5.3.1.1. Extraversion

It can be noted in Table 5.1 that the person parameter estimates of the non-adaptive

core and non-adaptive full test-scales demonstrated a strong correlation of .95. This

means that the non-adaptive core and non-adaptive full test-scale versions shared

approximately 90% of the variance for their respective person parameter estimates for the

Extraversion scale.

Table 5.1

Correlations between test-form person parameter estimates for the BTI

Extraversion scale

Measure Adaptive Core Non-Adaptive Core Non-Adaptive Full

Adaptive Core ___

Non-Adaptive Core .93*** ___

Non Adaptive Full .89*** .95*** ___

Adaptive Full .82*** .74*** .81***

Note. *** = p < .001

Furthermore, this indicated that person parameter estimates of the shortened core

version of the Extraversion test-scale was approximately equivalent to the full non-

adaptive version of the Extraversion test-scale and that the shortened and optimised nature

152

of the non-adaptive core test version did not affect the person parameter values estimated

in any substantial manner.

The correlations between the person parameter estimates of the adaptive core and

non-adaptive core version of the Extraversion scale also indicated a strong correlation of

.93. This means that the adaptive core version of the test-scale shared approximately

86.49% of the variance with the non-adaptive core version of the test-scale and thus the

adaptive nature of the core version did not drastically alter the values of the person

parameter estimates. However, the correlation between the person parameter estimates of

the adaptive full and non-adaptive full test-scale versions of the Extraversion scale

evidenced a slightly lower correlation of .81 when compared to the correlation between

adaptive core and non-adaptive core person parameter estimates. The person parameter

estimates of the adaptive full and non-adaptive full test-scale versions shared

approximately 65.61% of the variance which was substantially lower than the shared

variance between the person parameter estimates of the adaptive core and non-adaptive

core versions of the test-scales. This indicated that the person parameter estimates were

substantially affected by the adaptive nature of the test-scale in either its full or core form.

When the person parameter estimates of the adaptive core test-scale for the

Extraversion scale was compared to the non-adaptive full version of the test-scale a strong

correlation was found (.89). This indicated that the adaptive core version shared

approximately 79.21% of the variance with the non-adaptive full test-scale. Therefore,

the adaptive core version of the Extraversion scale recovered the person parameter

estimates of the full non-adaptive version of the test-scale very well. As the recovery of

the person parameters are especially critical for the validation of the computer adaptive

test version of the BTI scales a cross-plot between the person parameter estimates of the

adaptive core and non-adaptive full test-scales is presented in figure 5.3.1.1.

153

It can be noted from the cross-plot that the person parameter estimates for both the

adaptive core and non-adaptive full test-scale version was approximately equivalent

within the 95% standard error confidence interval bands. This was evident because most

of the cross-plotted person parameter estimates fell within the 95% standard error

confidence interval bands with only a few falling outside these bounds.

Figure 5.3.1.1 Cross Plot of Person Measures for the Adaptive Core and Non-Adaptive Full

Scales of Extraversion

Conversely, the adaptive full test-scale was unable to recover the non-adaptive core

person parameter estimates as effectively – with a correlation of .74. Thus, the adaptive

full test-scale shared approximately 54.76% of the variance with the non-adaptive core

test-scale person parameter estimates. Also, the adaptive core version of the test-scale

was better able to recover the original non-adaptive full form person parameter estimates

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3 4

Per

son

Mea

sure

(A

dap

tive

Core

)

Person Measure (Non-Adaptive Full)

154

(r =. 89) than the adaptive full test-scale, which evidenced a slightly lower correlation of

.81.

The correlation between the person parameter estimates of the adaptive full and

adaptive core test-scales was .82. This indicated that the adaptive core and adaptive full

test-scale versions of the BTI shared approximately 67.24% of the variance across their

respective person parameter estimates. This means that there was a substantial difference

between the person parameter estimates for the adaptive core and non-adaptive core test-

scales of the Extraversion scale when compared to the adaptive full and non-adaptive full

test-scales. When the recovery of the person parameter estimates of the adaptive core test-

scale and the non-adaptive full test-scale were considered; it appeared that the adaptive

core test-scales recovered the person parameter estimates of the non-adaptive full test-

scale and the non-adaptive core test-scale more optimally than the adaptive full test.

5.3.1.2. Neuroticism

Table 5.2 displays the correlation coefficients between the different person

parameter estimates of the different test-scale versions for the Neuroticism scale. It can

be noted that the correlation between the person parameter estimates for the non-adaptive

core and non-adaptive full test-scales were approximately equivalent at .97.

155

Table 5.2


Neuroticism scale


Adaptive Core ___



Adaptive Full .94*** .92*** .93***

Note. *** = p < .001

Consequently, the variance shared between the person parameter estimates of the

non-adaptive core and non-adaptive full test-scales of the Neuroticism scale was

approximately 94.09%. This indicated that the optimised and shortened nature of the core

test scale did not alter the person parameter estimates substantially when compared to the

non-adaptive full test-scale.

When the person parameter estimates for the adaptive versions of the test-scales

(core and full scales) were compared to the person parameter estimates of their non-

adaptive counterparts, the adaptive core test-scale demonstrated a higher correlation (.96)

than the adaptive full test-scale (.93). When the shared variance between person

parameter estimates was compared, the adaptive core test-scale shared 92.16 % of the

variance with its non-adaptive core test-scale counterpart whereas the adaptive full test-

scale shared 86.49% of the variance with its non-adaptive full test-scale counterpart. This

indicated that the adaptive core version of the test-scale was better able to recover its own

non-adaptive person parameter estimate when compared to the adaptive full version of

the test-scale.

156

When the person parameter values of the adaptive core test-scale were compared

to the person parameter values of the non-adaptive full test-scale a correlation of .94 was

found. As the adaptive core test-scale is the scale that will be used in practical adaptive

testing a cross-plot of person parameters between the adaptive core and non-adaptive full

test-scales was generated (refer to figure 5.3.1.2 below).


Scales of Neuroticism

It can be noted in this figure that the person parameter estimates estimated by the

adaptive core test-scale version was approximately equivalent to the person parameters

estimated using the non-adaptive full test-scale version within the 95% standard error

confidence interval bands. This indicates that the adaptive core version of the test-scale

-7

-6

-5

-4

-3

-2

-1

0

1

2

3

4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

Per

son

Mea

sure

(A

dap

tive

Co

re)

Person Measure (Non-Adaptive Full)

157

is capable of recovering the person parameters of the original non-adaptive full form test-

scale quite well.

However, when the person parameter values of the adaptive full test-scale were

compared to the non-adaptive core test-scale the correlation was slightly lower at .92. The

person parameter values estimated by the adaptive core test-scale therefore shared

approximately 88.36% of the variance with the person parameter values estimated by the

original non-adaptive full version of the test-scale. The adaptive core version of the test-

scale therefore recovered the person parameter values of the original non-adaptive full

form test-scale quite well. In comparison, the person parameter values estimated by the

adaptive full test-scale shared 84.64% of the variance with the person parameter estimates

of the non-adaptive core test-scale. This indicated some deviation of the person parameter

estimates of the adaptive full test-scale when compared to the adaptive core test-scale.

Additionally, the adaptive core test-scale was able to recover the person parameter

estimates of the original non-adaptive full form test-scale slightly better with a correlation

of .94 when compared to the adaptive full test-scale which had a correlation of .93 with

the original non-adaptive full form test-scale. The correlation of person parameter

estimates between the adaptive core test-scale and the adaptive full test-scale indicated

that approximately 88.36% of the variance between person parameter estimates was

shared. This demonstrated that the adaptive full and adaptive core test-scales of the

Neuroticism scale were almost equivalent. However, slightly improved correlations of

the adaptive core test-scale person parameter estimates, when compared to the adaptive

full test-scale, indicated that the adaptive core version was slightly better able at

estimating person parameters equivalently to the original full form test-scale of the

Neuroticism scale.

158

5.3.1.3. Conscientiousness

Table 5.3 summarises the correlations between the person parameter estimates of

the different versions of the Conscientiousness scale. The correlation between the person

parameter estimates for the non-adaptive core and non-adaptive full versions of the test-

scale of the Conscientiousness scale was .94.

Table 5.3


Conscientiousness scale


Adaptive Core ___



Adaptive Full .85*** .86*** .93***

Note. *** = p < .001

The person parameter estimates of the non-adaptive full test scale and the non-

adaptive core test-scale shared approximately 88.36% of the variance. This indicated that

the difference in items administered between the optimised shortened core version of the

test-scale and the full version of the test-scale had little impact on the person parameter

estimates. When the person parameter estimates of the adaptive core and the adaptive

full test scales were compared to their non-adaptive counterparts the correlation between

these person parameter estimates for the adaptive core and non-adaptive core (.95) test-

scales was slightly higher than for the adaptive full and non-adaptive full test-scales (.93).

The person parameter estimates of the adaptive core thus shared 90.25% of the variance

with its non-adaptive core counterpart whereas the person parameter estimates of the

159

adaptive full version of the test shared 86.49% of the variance with its not-adaptive full

test-scale counterpart. This indicated that the adaptive core test-scale was better able at

recovering its own non-adaptive person parameter estimates than the adaptive full test-

scale.

In contrast to the Extraversion and Neuroticism scales the adaptive full version of

the Conscientiousness scale recovered the original full form non-adaptive person

parameter estimates just as well as the adaptive core test-scale both evidencing a

correlation of .93. The adaptive core and the adaptive full test-scale of the

Conscientiousness scale thus shared the same variance across person parameter estimates

(86.49%). However, the adaptive full test-scale was not able to recover the person

parameter estimates of the non-adaptive core test-scale as well (r = .86) as the adaptive

core version of the test could recover the person parameter estimates of the non-adaptive

full test-scale (r = .93). If this result is taken into account when the correlation between

the person parameter estimates of the two adaptive versions of the test-scales are

compared (.85) it becomes evident that the adaptive core test-scale estimated person

parameters somewhat differently from the adaptive full test-scale. The person parameter

estimates of the adaptive core version of the test shared approximately 72.25% of the

variance with the adaptive full version of the test and the person parameter estimates of

the adaptive full test-scale shared approximately 73.96% of the variance with the non-

adaptive core test-scale. This indicates that there was a slight difference in person

parameter estimation between the adaptive core and adaptive full test-scales for the

Conscientiousness scale. However, as the adaptive core test-scale was able to recover its

own non-adaptive person parameter estimates better than the adaptive full version of the

test-scale it may indicate better performance, albeit slightly, for the adaptive core test-

scale.

160

Consequently, if the cross-plot of the person parameters of the adaptive core and

non-adaptive full test-scale versions is referred to (refer to figure 5.3.1.3c), both sets of

parameter estimates indicated approximate equivalence within the 95% standard error

confidence interval bands. This outcome indicates good equivalence of person parameter

estimation between the adaptive core and the non-adaptive full form test-scale.


Scales of Conscientiousness

5.3.1.4. Openness

The correlation between the person parameter estimates of the non-adaptive core

and non-adaptive full test-scales of the Openness scale was strong at .96 (refer to table

5.4). This indicated that the person parameter estimates of the non-adaptive full test-scale

-4

-3

-2

-1

0

1

2

3

4

5

6

7

8

-3 -2 -1 0 1 2 3 4 5 6 7 8Mea

sure

(A

dap

tive

Core

)

Measure (Non-Adaptive Full)

161

shared approximately 92.16 % of the variance with the person parameter estimates of the

non-adaptive core test-scales.

Table 5.4

Correlations between test-form person parameter estimates for the BTI Openness

scale


Adaptive Core ___



Adaptive Full .89*** .86*** .91***

Note. *** = p < .001

Consequently, the optimised and shortened nature of the core test-scale appeared

to make little difference to person parameter estimates. When the person parameter

estimates of the adaptive core and adaptive full test-scales were compared to their non-

adaptive counterparts; the adaptive core test-scale evidenced a stronger correlation (.96)

compared to the adaptive full (.91) test-scale for person parameters estimated. The

adaptive core person parameter estimates shared approximately 92.16% of the variance

with the non-adaptive core test-scale, whereas the person parameter estimates of the

adaptive full version of the test shared 82.81% of the variance with the person parameter

estimates of the its non-adaptive counterpart. Consequently, this indicates that the

adaptive core test-scale was better able to recover the person parameter estimates of its

non-adaptive counterpart than the adaptive full test-scale. Also, when the correlations

between person parameter estimates for the adaptive core and the full form non-adaptive

test-scales (.92) and the adaptive full and non-adaptive full test-scales (.91) were

162

compared, the adaptive core test-scale was slightly better able to recover the person

parameter estimates of the non-adaptive full test-scale. The shared variance between the

person parameter estimates of the adaptive core and non-adaptive full test-scales

amounted to 84.64% whereas the shared variance between the person parameter estimates

of the adaptive full and non-adaptive full test-scales amounted to 82.81%. These results

indicated that the adaptive core test-scale was slightly better at recovering person

parameter values estimated by the full non-adaptive version of the test.

When the person parameter estimates of the adaptive full version of the test was

correlated to the non-adaptive core version of the test (.86) a slightly greater deviation

between person parameter estimates was found. This is especially evident when the

correlation between the person parameter estimates of the adaptive core test-scale and the

non-adaptive full test-scale (.92) was considered. This means that the person parameter

estimates of the adaptive core test-scale shared approximately 84.64% of the variance

with the person parameter estimates of the non-adaptive full test-scale.

Again, as this is an important comparison (adaptive core with non-adaptive full)

the person parameter estimates of both versions were cross-plotted within their respective

95% standard error confidence interval bands (refer to figure 5.3.1.4 below).

163


Scales of Openness

It can be noted from the cross-plot of person parameters that the person parameters

are approximately equivalent between the adaptive core and non-adaptive full test scale

versions within the 95% standard error confidence interval bands. This indicated that the

shortened and adaptive core scale was able to recover the same person parameter

estimates estimated by the non-adaptive full form test-scale.

The adaptive full test-scale, on the other hand, shared only about 73.96% of the

variance with the person parameter estimates of the non-adaptive core test-scale. If this

result is taken into consideration when the correlation between person parameter

estimates for the adaptive core and the adaptive full test-scales are considered (.89) – with

a shared variance of 79.21% – then it becomes evident that the person parameter estimates

of the adaptive core test-scale differed somewhat from the person parameter estimates of

-3

-2

-1

0

1

2

3

4

5

6

7

-3 -2 -1 0 1 2 3 4 5 6 7 8Mea

sure

(A

dap

tive

Core

)


164

the adaptive full test-scale. However, since the adaptive core test-scale was better able to

recover its own non-adaptive core and the non-adaptive full person parameter estimates

when compared to the adaptive full test-scale; the adaptive core test-scale slightly

outperformed the adaptive full test-scale of the Openness scale.

5.3.1.5. Agreeableness

The correlations between the person parameter estimates for the non-adaptive full

test-scale and the non-adaptive core test-scale of the Agreeableness scale evidenced a

strong correlation of .97 (refer to table 5.5). This means that that the person parameter

estimates of the non-adaptive core and non-adaptive full test-scale versions shared

approximately 94.09% of the variance.

Table 5.5


Agreeableness scale


Adaptive Core ___



Adaptive Full .86*** .79*** .86***

Note. ***p < .001

Consequently, the shortened and optimised nature of the non-adaptive core test-

scale appeared to make little difference to the estimated person parameter values.

However, the adaptive core test-scale appeared to recover its non-adaptive person

parameter estimates better than the adaptive full test-scale. The adaptive core test-scale

165

person parameter estimates correlated .93 with the non-adaptive core test-scale person

parameter estimates; and the adaptive full test-scale person parameter estimates correlated

.86 with the non-adaptive full test-scale person parameter estimates.

A cross-plot of person parameter estimates, as estimated by the adaptive core and

non-adaptive full form test-scales for Openness, are presented in figure 5.3.1.5 below.


Scales of Agreeableness

It can be noted from the figure that the person parameter estimates of the adaptive

core test-scale were approximately equivalent to the person parameters estimated by the

non-adaptive full test-scale (within the bounds of the 95% standard error confidence

intervals). This once again indicates that the adaptive core version of the test-scale was

able to effectively recover the person parameter estimates estimated by the non-adaptive

full form of the scale.

-4

-3

-2

-1

0

1

2

3

4

5

6

7

-4 -3 -2 -1 0 1 2 3 4 5 6 7 8

Mea

sure

(A

dap

tive

Core

)


166

The correlations between the person parameter estimates of the different test

scales translated to 86.49% shared variance between person parameter estimates for the

adaptive core and non-adaptive core test-scales compared to 73.96% shared variance

between the person parameter estimates for the adaptive full and non-adaptive full test-

scales.

Additionally, if the correlation between the person parameter estimates of the

adaptive core test-scale and the non-adaptive full test-scale is considered (.92) and

compared to the correlation between the person parameter estimates of the adaptive full

and non-adaptive full test-scales (.86); it became evident that the adaptive core test-scale

was better able to recover the person parameter values of the non-adaptive full test-scale

version. The difference between the adaptive full and adaptive core test-scales becomes

even more pronounced when the person parameter estimates of either test-scale was

correlated to the non-adaptive core test-scale and non-adaptive full test-scale respectively.

The person parameter estimates of the adaptive core test-scale correlated .92 with the

person parameter estimates of the non-adaptive full test-scale and the person parameter

estimates of the adaptive full test-scale correlated .79 with the person parameter estimates

of the non-adaptive test-scale. This translated to a shared variance between person

parameter estimates for the adaptive core and non-adaptive full test-scales of 84.64%

compared to the shared variance between person parameter estimates of the adaptive full

and non-adaptive core test-scales of approximately 62.41%.

When the person parameter estimates of the adaptive core test-scale were

compared directly to the person parameter estimates of the adaptive full test-scale the

correlation was .86 with an associated shared variance of 73.96%. These results indicated

that the person parameter estimates of the adaptive core test-scale version of the

Agreeableness scale differed substantially from the person parameter estimates of the

167

adaptive full test-scale version. However, since the adaptive core test-scale was better

able to recover its own non-adaptive person parameter estimates as well as the person

parameter estimates of non-adaptive full test-scale in comparison to the adaptive full test-

scale; it can be argued that the adaptive core test-scale performed better than its adaptive

full test-scale counterpart.

The next section summarises the correlations for person parameter estimates

between the adaptive full and adaptive core test-scale versions. This section also presents

summary statistics on the performance of the adaptive full test-scale versions and how

this compares to the performance of the adaptive core test-scale versions. The

performance of either the test-scale versions was evaluated by looking at the person

parameter recovery statistics, item usage statistics, the mean standard error of person

parameter estimation, and the mean person parameter values.

5.3.2. Computer adaptive core test performance indices

Correlations between the person parameter estimates of the different test forms

have indicated that the adaptive core versions for each of the scales of the BTI recover

the person parameter estimates of the non-adaptive full test-scales as well, if not better

than, the adaptive full test-scale versions. This bodes well for the computer adaptive test

performance of the core adaptive test-scale versions. However, there are numerous other

computer adaptive performance indices that need to be evaluated to determine whether

the core adaptive versions function better than the full adaptive versions of the test-scales.

Table 5.6 summarises a number of comparison indices between the adaptive core and the

adaptive full test-scales of the BTI.

168

5.3.2.1. Recovery of the adaptive core and adaptive full person parameters

In the previous section, the correlations between person parameter estimates for

the different test-scale versions of the BTI were reported. Table 5.6 summarises the

correlations between person parameter estimates for the non-adaptive versions of both the

Table 5.6

Performance indices of the adaptive core and adaptive full test-scales

Indices Extraversion Neuroticism Conscientiousness Openness Agreeableness

Core Full Core Full Core Full Core Full Core Full

r NA .93 .81 .96 .93 .95 .93 .96 .91 .93 .86

r Full .89 ___ .94 ___ .93 ___ .92 ___ .91 ___

Start Item bti11 bti25 bti40 bti42 bti103 bti103 bti153 bti153 bti158 bti166

Mn. Item 11.84 9.91 13.82 12.71 20.41 18.02 14.53 12.07 14.79 11.97

Mn Info. .48 .48 .47 .48 .46 .47 .47 .48 .47 .47

Mean SE .29 .29 .30 .30 .32 .30 .30 .29 .30 .29

Min SE .28 .27 .28 .28 .28 .28 .28 .27 .28 .28

Max SE .40 .30 .41 .37 .38 .33 .40 .37 .35 .34

SE SD .01 .01 .02 .01 .03 .01 .02 .01 .01 .01

SE Range .12 .03 .13 .09 .10 .05 .12 .10 .08 .06

Mean θ .47 .27 -.97 -.92 1.76 1.40 .98 .78 .99 .76

Min θ -1.71 -2.21 -3.36 -3.53 -2.20 -2.24 -1.54 -1.46 -2.18 -2.15

Max θ 3.10 2.35 2.31 2.41 3.62 3.61 3.47 3.52 3.62 3.61

θ SD .60 .54 .83 .78 1.03 .80 .70 .59 .73 .60

θ Range 4.80 4.56 5.68 5.93 5.82 5.85 5.01 4.99 5.80 5.76

Note. SE = standard error; SD = standard deviation; Mean Items = mean number of

items administered; Mean Info. = mean item information; r NA = correlation with non-adaptive version;

r Full = correlation with non-adaptive full scale; θ = theta or the person parameter estimate.

169

core and full test-scales (rNA) for ease of reference. It also summarises the correlations

between the person parameter estimates of the adaptive core and adaptive full test-scales

and the non-adaptive full test-scales (rFull). What can be noted in this summary of

correlations is that (a) the adaptive core test-scale versions better recover the non-adaptive

core person parameter estimates than the adaptive full test-scale versions do; and (b) that

for all the scales of the BTI, except for Conscientiousness, the adaptive core test-scales

recover the person parameter estimates of the non-adaptive full test scales better than the

adaptive full test-scale versions. For Conscientiousness the adaptive core version equals

the recovery of the adaptive full test-scale for the person parameter estimates. The next

section reports on the summary of item usage statistics.

5.3.2.2. Summary of the item usage statistics

The computer adaptive test-scale versions of the BTI use a two-stage branching

technique where certain starting items are given to all the test-takers in a uniform manner.

These items and their usage can be viewed in Table 5.6. For both the Conscientiousness

and Openness core and full adaptive test-scales, bti103 and bti153 were administered as

starting items. For Extraversion, bti11 was administered for the adaptive core test-scale

and bti25 for the adaptive full test-scale. For Neuroticism, bti40 was administered for the

adaptive core test-scale and bti42 for the adaptive full test-scale. Finally, for

Agreeableness, bti158 was administered for the adaptive core test-scale and bti166 for the

adaptive full test-scale. These items had the highest information functions for each scale

of the BTI and were situated near the middle of the trait distribution for persons.

Therefore, these items were considered the most optimal for interim person parameter

estimation using the fixed two-stage branching technique.

170

The mean number of items administered by the adaptive core and adaptive full

test-scale versions indicated that the adaptive full test-scales used on average fewer items

to estimate the latent trait than other test-scale versions. The difference between the

average number of items used by the adaptive full and adaptive core test-scales were as

follows: Extraversion 1.91 items; Neuroticism 1.11 items; Conscientiousness 2.39 items;

Openness 2.39 items; and Agreeableness 2.82 items. On average the adaptive full test-

scales used 2.12 fewer items than the adaptive core test version to accurately estimate

person parameters. Further item usage statistics for the adaptive full and adaptive core

test-scales can be viewed from Figure 5.1a to 5.5b in Appendix A.

5.3.2.2.1. Number of items administered for the adaptive core and full test

scales

For Extraversion the adaptive full test-scale version most frequently administered

9 items per test-taker whereas the adaptive core test-scale most frequently administered

11 items per test-taker. The most frequently administered number of items per test-taker

for the Neuroticism adaptive full test-scale was 10 items compared to 11 items for the

adaptive core test-scale. For Conscientiousness the most frequently administered number

of items per test-taker was 13 for the adaptive full test-scale version whereas the adaptive

core test-scale version most frequently administered 23 items. For Openness both the

adaptive full and adaptive core test-scale versions administered 11 items most frequently

per test-taker. Finally, for Agreeableness the adaptive full test-scale version administered

11 items most frequently per test-taker whereas the adaptive core test-scale version

administered 12 items most frequently per test-taker.

Although fewer items were available for administration in the adaptive core item

bank more items were administered to estimate the latent trait for test-takers than the

171

longer adaptive full test-scales. It can also be noted that the mean item information

functions were relatively equivalent between the adaptive core and full test-scale

versions. Because the adaptive core test-scale versions had fewer items to administer with

approximately the same mean item information as the longer adaptive full test-scales;

more items were used by the adaptive core test-scales to estimate person parameters. This

is because some items that target the extremes of the person distributions have been

removed from use in the adaptive core test-scale versions (refer to Chapter 4) and thus

fewer items are available to target persons at the extremes of the person trait distribution.

This results in slightly higher item usage by the adaptive core test-scale versions and is

not unexpected.

The next section reports on the mean item information, standard error of person

parameter estimation and person parameter statistics for the adaptive full and adaptive

core test-scale versions. The efficiency of items administered with the non-computer

adaptive test-scale versions of the BTI were also compared.

5.3.2.2.2. Summary of the mean item information, standard error of person

parameter estimates and person parameter statistics

As mentioned in the previous section the items have relatively similar mean

information function values regardless of whether the adaptive full or adaptive core test-

scale versions were administered. What can be noted in Table 5.6 is that the mean

information functions of the items differ marginally between the adaptive full and

adaptive core test-scale versions with the adaptive core test-scales evidencing slightly

lower mean item information statistics. This may have resulted in slightly higher item

usage statistics for the adaptive core test-scales. The item information functions relative

to the latent trait and the standard error of person parameter estimation can be seen in

172

Figures 5.6a to 5.10b in Appendix B. Extraversion is used as an example to illustrate how

the different versions of the test-scales functioned.

As the stopping rules for item administration were set to the standard error ≤ .30

criterion, the items were administered until person parameter estimates were estimated as

closely to this criterion as possible. To demonstrate how the standard error criterion

stopping rule and the item information functions affect the number of items administered

refer to Extraversion in Figure 5.6a for the adaptive full test version and Figure 5.6b for

the adaptive core test versions. The Extraversion scale is used to demonstrate how the

standard error criterion affected the items administered. For the other scales refer to

Figure 5.7a to Figure 5.15b in the appendix of this document.

For the adaptive full scale of Extraversion, it can be noted that the standard error criterion

is reached once approximately 9 items have been administered. It can also be noted that for

Figure 5.6a. Maximum attainable information contributed with each

item administered for the Extraversion full item bank.

173

each item administered the highest attainable information (per item) is found near the center of

the trait distribution as this is where the item parameters are most optimally targeted to person

parameters and is thus where the most information per item is available. At the extremes of the

trait distribution available item information ‘dwindles’ as fewer items that measure very low

or high levels if the trait were available for administration. Also, there were fewer persons

available at these extreme low and high trait levels for adequate item parameter estimation thus

resulting in slightly less precise item parameters.

This can be noted when the mean standard error of the person parameter estimates is

taken into account in Table 5.6. For example, if the trait level at the standard error of person

parameter estimates of .30 is contrasted between the adaptive full and adaptive core test-scale

versions for Extraversion it can be noted that item information, as a product of the range of the

trait, is slightly greater for the adaptive full version of the test-scales. This was because there

Figure 5.6b. Maximum attainable information contributed with each

item administered for the Extraversion core item bank.

174

were more items available for administration for the adaptive full test-scale versions than for

the adaptive core test-scale versions which allowed for improved trait-item targeting.

However, the mean standard error for each scale, for both the adaptive full and adaptive

core test-scales, remained within acceptable limits and did not deviate greatly from one another

(refer to Table 5.6). Only Conscientiousness had a mean standard error of .32 which was

slightly above the required mean standard error cut-off of .30 but still within the recommended

standard error ≤ .33, which is considered a very stringent criterion. What was different between

the adaptive full and adaptive core test-scale versions was the range of the person parameter

estimates and mean standard error, where the adaptive core test-scales had a slightly greater

person parameter estimate range and standard error range than the adaptive full test-scale. This

can be attributed to the precision of measurement which is directly related to the number of

items available for administration in the item bank.

When the standard deviation indices were consulted in Table 5.6 for the person parameter

estimates and the mean standard error of person parameter estimates; the adaptive core test-

scale evidenced slightly higher standard deviation values than the adaptive full test-scale for

both these indices. This was because fewer items were available to target the extremes of the

measured trait for each scale, but it also indicates better relative item parameter spread.

When the item usage statistics across the trait continuum was consulted in Appendix C

from Figures 5.11a to 5.15b it can also be noted that more items were administered at higher

and lower trait levels of the trait continuum to reach the standard error criterion of ≤ .30.

175

Figure 5.11a shows that for the adaptive full test-scale version of Extraversion at most

10 items were needed to reach the standard error ≤ .30 criterion for persons whose loadings fell

between -1.00 and 1.00 logits. However, above and below these trait levels more and more

items were needed to reach the standard error criterion. For example, above 2.00 logits up to

30 items were needed to reach the standard error criterion, although the frequency of

administering more than 10 items to reach this criterion was relatively small.

Figure 5.11a. Number of items administered across the trait continuum

for the Extraversion full item bank.

176

In contrast, the adaptive core test-scale version used between 10 and 12 items most

frequently to estimate person parameters within the standard error criterion. Also, there was

more frequent administration of more than 12 items for persons with a trait level above 2.00

logits than for the adaptive full test-scale version. A similar pattern of item administration can

be seen for the other scales of the BTI between the adaptive core and adaptive full test-scale

versions. Therefore, a smaller item bank with fewer items targeted at the extremes of the trait

distribution results in proportionately higher item usage statistics.

This is most evident with the Conscientiousness scale. With reference to Chapter 4

Conscientiousness was one of the only scales of the BTI where items – with a certain item

difficulty - were unable to target persons with high levels of the Conscientiousness trait

(although persons with high levels of the trait were present in the sample distribution). With

reference to the mean items administered for Conscientiousness in Table 5.6 for the adaptive

Figure 5.11b. Number of items administered across the trait continuum

for the Extraversion core item bank.

177

full and adaptive core test-scale versions it can be noted that both test versions used more items

to reach the standard error criterion than the other scales. In most cases the adaptive core test-

scale version administered all 23 items in the shortened item bank to reach the standard error

criterion (see Figure 5.3a and 5.3b in Appendix A). Conscientiousness also had the highest

mean standard error of person parameter estimation for the adaptive core test-scale version

when compared to the other adaptive core test-scale versions for the other scales.

The maximum attainable information per item administered reflects these findings (refer

to Figures 5.6a to 5.10b in the appendix of this document), particularly how the maximum

information functions of the adaptive core test-scale version become smaller as a product of

the range of the trait measured when compared to the adaptive full test-scale version (refer to

Figure 5.8b and 5.8a in Appendix B). Finally, with reference to the item usage statistics in

Appendix A it can be noted that the administration of more items at the extremes of the trait

distribution was more frequent for the adaptive core test-scale version of Conscientiousness

than for any other adaptive core test-scale version for the other scales of the BTI (refer to Figure

5.13a and 5.13b). The adaptive core test-scale versions of all the other scales followed a similar

pattern to the Conscientiousness scale with more items being administered to reach the standard

error criterion. However, major efficiency gains were still made when a comparison between

the adaptive core test-scale versions and the non-adaptive full test-scale versions were made

thus, to some degree, justifying the slightly higher item usage statistics. The efficiency of the

items administered are reported in the next section.

5.3.2.3. Efficiency gains of adaptive item administration compared to non-adaptive

administration

Table 5.7 summarises the proportion of test items that were not administered on average

by the adaptive test-scale versions in comparison to the non-adaptive test-scale versions. What

178

is evident are the gains in efficiency made by both the adaptive core and adaptive full test-scale

versions when compared to their full form non-adaptive test-scale versions.

Table 5.7

Percentage of items not administered by the adaptive test versions

Scale Adaptive Core/ Non-

Adaptive Full

Adaptive Full/ Non-

Adaptive Full

Adaptive Core/ Non-

Adaptive Core

Extraversion 67% 72% 49%

Neuroticism 59% 63% 40%

Conscientiousness 50% 56% 11%

Openness 55% 62% 37%

Agreeableness 60% 68% 49%

Note. Percentages reflect the proportion of items that were not administered by the adaptive test

versions in comparison to the non-adaptive test versions

On average the adaptive core test-scale versions resulted in between a 50% to 67%

greater efficiency of item usage when compared to the non-adaptive full form test-scales. This

amounts to an average item administration reduction of approximately 58%. The adaptive full

test-scales improved item administration efficiency by between 56% and 72% when compared

to the non-adaptive full test-scale versions. This resulted in an even higher item administration

reduction of approximately 64%. There are also gains in efficiency when the shortened

adaptive core test-scale versions are compared to their own non-adaptive core test-scale version

with efficiency gains ranging from between 11% to 49%. On average the adaptive core test-

scales had a 37% item administration reduction when compared to the non-adaptive core test-

scales. This means that the adaptive core and adaptive full test-scales of the BTI reduced

average item administration by about one third.

179

5.4. Discussion

Against the background of previous literature on the comparison of computer adaptive

test-scale versions with their non-adaptive counterparts, results indicate that computer adaptive

tests can be more efficient, as accurate and precise as their non-computer adaptive counterparts

(cf. Betz & Turner, 2011; Chien et al., 2009; Forbey & Ben-Porath, 2007; Gibbons et al., 2008;

Hobson, 2015; Hol et al., 2008; Pitkin & Vispoel, 2001; Reise & Henson, 2000; Smits et al.,

2011; Triantafillou et al., 2008). Direct comparison between the computer adaptive test-scale

versions of inventories and their non-adaptive counterparts is an essential step in the evaluation

of a computer adaptive tests that can determine the metric equivalence and feasibility of such

adaptive tests (cf. Mead & Drasgow, 1993; Pomplun, Frey, & Becker, 2002; Vispoel, Wang,

& Bleiler, 1997; Vispoel et al., 1994; Wang & Kolen, 2006).

This study aimed to determine whether an optimised computer adaptive test could be

as efficient, accurate and precise as the non-computer adaptive test-scale counterparts.

Consequently, the purpose of this study was to evaluate the BTI as a computer adaptive

test by comparing, through computer adaptive test simulation, the functioning of the computer

adaptive test-scales of the BTI to their non-adaptive fixed form test-scale counterparts. This

was accomplished through the generation of an optimised “core” item bank for each scale of

the BTI, which was simulated within a computer adaptive test framework. This simulation

made use of real test-taker response data garnered using the un-optimised “full” non-adaptive

BTI test-scale forms. Comparison between the estimated person parameter values for the non-

adaptive core, adaptive core, adaptive full and non-adaptive full test-scale forms were made.

Item usage statistics, the standard error of person parameter estimates and the mean item

information functions were then compared between the adaptive core and adaptive full test-

scale versions.

180

The outcome of the correlations between person parameter estimates for the various

test-scale forms are discussed and then the computer adaptive performance indices between

the adaptive full and adaptive core test-scale versions are evaluated. The limitations of this

study are then discussed and recommendations made for future research.

5.4.1. Correlations between person parameter estimates of the various adaptive and

non-adaptive test forms

Comparisons between the person parameter estimates for the various test forms lent

extensive support for the use of the adaptive core test-scale version. This is most notable when

the comparison between the person parameter estimates for the adaptive core and adaptive full

test-scale versions and the non-adaptive core and non-adaptive full test-scale versions were

evaluated. Whereas the shortened and optimised nature of the non-adaptive core item bank

made little difference to person parameter estimates when compared to the non-adaptive full

form test-scale versions, the adaptive nature of the test-scales altered the way person

parameters were estimated across the adaptive core and adaptive full test-scale versions.

On average the adaptive core test-scale versions of the BTI were able to recover the non-

adaptive full and non-adaptive core test-scale person parameter estimates better than the

adaptive full test-scale versions could. However, there were some differences between the

person parameter estimates of the adaptive core and adaptive full test-scale versions when they

were directly compared. These results indicated that the optimised core item bank was

relatively equivalent to the non-adaptive un-optimised item banks. However, person parameter

estimates diverged somewhat between the adaptive core and adaptive full test-scale versions,

which may indicate that the optimised nature of the core item banks used in a computer adaptive

manner may estimate person parameter in a slightly different manner.

181

An even greater difference between person parameter estimates can be observed between

the adaptive full test version person parameter estimates and the non-adaptive core test version

person parameter estimates. Finally, divergence between the person parameter estimates of the

adaptive core and adaptive full test-scale versions were noted for every scale of the BTI. There

were thus definite differences between the person parameter estimates of the adaptive core and

adaptive full test-scale versions. This may be due to a number of factors; however, it is

postulated that the relative imprecision of the un-optimised items in the adaptive full test-scales

are to blame for these differences. It is thus emphasised the importance of ensuring that item

banks are sufficiently evaluated and calibrated before computer adaptive testing frameworks

are applied to item banks.

What needed to be established in this study was whether the adaptive core test-scales

differed in an optimal manner from the adaptive full test-scale versions regarding the different

person parameter estimates. If the optimised nature of the items of the adaptive core test-scale

versions were taken into account, then the differences between the adaptive core and adaptive

full test-scale versions indicated that the adaptive core test-scales were able to more accurately

estimate person parameters than the adaptive full test-scale versions of the BTI. This

conclusion was reached because the adaptive full item bank possesses numerous items that

evidenced differential item functioning and poor fit to the Rasch rating scale model (refer to

Chapter 4). Conversely, the adaptive core test version utilised an item bank that evidenced

almost no differential item functioning or poor fit to the Rasch rating scale model. This

difference is not noticeable when the person parameters of the non-adaptive versions of the

test-scales are compared because the correlations between these versions indicated more or less

equivalent person parameter estimates. However once the item banks become adaptive, person

parameter estimates start to diverge between the optimised and un-optimised item banks. There

is thus evidence to suggest that the adaptive nature, and the level of item optimisation, of the

182

test-scales were the primary reasons that different person parameter estimates are obtained with

the adaptive versions of the test-scales.

Because the adaptive testing process estimates person parameter values using fewer items

and because the administration of consequent items are dependent on the interim person

parameter estimates garnered by previous items administered; computer adaptive item banks

are acutely sensitive to items that do not strictly measure a single construct (unidimensionality)

and which demonstrate differential item functioning for different groups of test takers.

Therefore, linear non-adaptive versions of the test-scale demonstrated approximate

equivalence because every item in the item bank was administered. However, once the test

functions adaptively, the administration of a number of suboptimal items substantially alters

the final person parameter estimates.

Let us explain this process using a metaphor of navigation. If one wishes to reach a

specific final destination one requires accurate current location data to plot a course to the final

destination. If the interim person parameter estimates are akin to the current location of the

navigator; the items are akin to quality of the global positioning system used, which is

indicative of the current parameter location data; and the destination is the final standing on

the latent trait measured; then each current location estimate must be highly accurate and

precise for the navigator to reach his or her final destination within a pre-specified acceptable

error. In non-adaptive testing each item gives accurate – in the case of the optimised item bank

– and sometime less accurate current location data – in the case of the un-optimised item bank.

If two navigators set off to a final location where one only uses the most accurate current

location data garnered from the best global positioning systems (akin to the optimised item

bank) and the other uses somewhat accurate current location data from less accurate global

positioning systems (akin to the un-optimised item bank) then both would likely reach a

different destination. Although the navigator which used the less accurate current location data

183

would be in a slightly different destination than the navigator that used only the accurate

estimates; enough accurate current location data was given for navigator two to come close to

the destination of navigator one. This is akin to the non-adaptive test-scales where enough

optimised items existed in the un-optimised item bank to estimate final person parameter

locations closely to that of the optimised item bank. In summary, the addition of these well-

performing items compensate for the poorer functioning items resulting in a person parameter

estimate that is still relatively accurate and precise.

However, if the two navigators are given fewer current location data estimates to find

their final destination (akin to adaptive testing), and each current location is dependent on the

accuracy of the previous location (item selection based on interim person location parameters);

then the two navigators may reach vastly different destinations. This is akin to adaptive testing

where each item allows an interim person parameter value to be estimated and the choice of

the next item is dependent on the person parameter estimates garnered by the previous item.

Therefore, “adaptive navigation” requires only the most accurate current location data (interim

person parameter estimates) to reach the final destination (final person parameter estimates).

In a similar way, the final person parameter estimates of the test-scales using the un-optimised

item bank in an adaptive manner, results in an inaccurate final person parameter estimate

compared to the adaptive optimised test-scale versions.

5.4.2. Adaptive core and adaptive full performance indices

The performance indices of the adaptive core in comparison to the adaptive full test-scale

versions (refer to Table 5.6) indicate most notably that the adaptive full test version (a) used

marginally fewer items to estimate person parameter values for the test-takers; and (b) was able

to estimate person parameters with a slightly lower standard error. It was postulated that these

results were directly associated with the number of items in the item bank and how the item

184

location parameters of these items were targeted to the person location parameters of the

sample of persons (refer to Chapter 4 and Figures 4.3.5.3a – e).

The adaptive full test version had many more items in their respective item banks and

also had numerous items that targeted the extremes of the trait continuum when compared to

the adaptive core test-scale versions. This difference in the number of items available and their

respective item location estimates may have resulted in slightly improved performance indices

for the adaptive full test-scales when compared to the adaptive core test scales. However, the

full item banks were not optimised and thus the accuracy of the final person parameter

estimates of the adaptive full test-scales can be called into question.

From an adaptive perspective the poorest functioning scale was Conscientiousness as it

has the lowest item usage efficiency for the adaptive core test and also had the highest standard

error for this test form. However, the standard error of person parameter estimation for

Conscientiousness was still acceptable at .32. Unfortunately, the adaptive core test version for

Conscientiousness administered all 23 items of the optimised item bank too frequently in order

to reach this low standard error. This indicates that the core Conscientiousness item bank may

not have enough items available for optimal person parameter targeting.

Therefore, the adaptive core test version of Conscientiousness administers more items

than its adaptive full counterpart to reach the standard error criterion of person parameter

estimates. This problem persists for the other adaptive core test-scale versions of the BTI,

albeit less so. It is therefore recommended that more items that target the upper and lower

extremes of the trait distribution be written for each adaptive core item bank for the BTI so that

the item efficiency of each scale is improved as the standard error criterion is reached.

However, on average the differences between the item usage statistics and the standard

error of person parameter estimates was marginally different for the adaptive core and adaptive

full test-scales. In addition, the adaptive full test-scales may not as accurately estimate person

185

parameters as the adaptive core test-scales as they do not make use of the optimised item banks.

In this way the adaptive core test-scales were more accurate albeit slightly less precise at

estimating final person parameter values for test-takers.

5.4.3. Item usage statistics

Major efficiency gains were made by the adaptive core and adaptive full test-scales when

compared to the non-adaptive full form test-scale versions. The difference in item efficiency

between the adaptive full and adaptive core test-scales was relatively small (6%) with both

versions only administering about a third of the items compared to the non-adaptive full form

test-scales. The adaptive full test-scales did however demonstrate much greater item efficiency

than the adaptive core test-scales when they are compared to their respective non-adaptive

counterparts (non-adaptive full and non-adaptive core). This result can be misleading because

the adaptive full test-scales have many more items in their respective item banks that the

adaptive core test versions in a proportional manner. Item efficiency in an absolute sense

between the adaptive core and adaptive full test-scales were therefore approximately

equivalent.

5.4.4. Implications for computer adaptive testing of personality

What has become evident from this study is that the computer adaptive test-scales of

the BTI, whether optimised or un-optimised, recover the person parameter estimates of the

non-adaptive full form test-scales in an equivalent manner. A corollary to this outcome is the

divergent person parameter values that are obtained when the adaptive core and adaptive full

test-scale person parameter estimates are compared. These outcomes underline the importance

of scale and item optimisation before computer adaptive testing is implemented. It also

emphasises that computer adaptive test-scales can be as accurate and precise at estimating

186

person parameter locations on the latent trait as the non-adaptive full form test-scales. The only

difference is the major gains in efficiency that the computer adaptive versions of the BTI scales

have made.

These findings have a direct impact for the future of personality testing. As personality

inventories are used widely for personnel selection, development and placement the shortening

of inventories to improve efficiency is a salient goal. Additionally, perceived test applicability

for the individual is also of major importance. More often than not, individual test-takers have

to respond to numerous items that have little bearing to them personally. Alternatively, to boost

the reliability of inventories, test-takers may need to respond to numerous items that have

approximately identical content in tests developed using only classical test theory. These

characteristics of personality inventories based on classical test theory, reduce test-taker

motivation and also reduce the overall face-validity of the instrument being administered. Since

it has been demonstrated that computer adaptive scales of the BTI demonstrate approximate

equivalence to the non-adaptive linear test-scales; computer adaptive testing can be

implemented to save time and reduce the burden on the test-taker while still producing person

parameter estimates that are accurate, precise and reliable.

5.4.5. Recommendations for future research

5.4.5.1. The impact of test-mode differences

Although this study demonstrated metric equivalence between the non-computer

adaptive and computer adaptive versions of the BTI scales, further research and validity

considerations need to be taken into account in future research.

Firstly, test-mode differences need to be investigated in future research endeavours.

Although the psychometric properties of computer adaptive simulated tests are employed with

real respondent data, and the equivalence of computer adaptive tests to their non-computer

187

adaptive counterparts can be evaluated, this research creates a framework for studying possible

test-mode differences. Within this study each test-taker completed the non-adaptive full form

of the test for each scale of the BTI. Therefore, the mode of testing was the same for all

respondents and the respondents were not exposed to the adaptive nature of the computer

adaptive test version in real-time. It is therefore recommended that future studies compare the

non-adaptive full form BTI test scores with test scores obtained through ‘real-time’ computer

adaptive testing with the same sample of test-takers. Test-mode differences may not be as great

as for ability-based inventories, but the effects of the test-mode must be determined for

computer illiterate individuals as well as for populations that are unfamiliar with psychometric

testing in general. These boundary cases are especially important in the South African context

where large proportions of the population do not have access to computers and for which

psychometric testing is novel.

5.4.5.2. Over exposure of items and content balancing

Computer adaptive testing relies heavily on a pre-calibrated item bank from which the

test draws items for administration. Often times certain items may be over-administered and

other items may be under-administered. Usually, items that have item location parameters at

the upper and lower extremes of the latent trait are underused because these items are not well

targeted to persons, whereas items closer to the center of the person distribution, which are well

targeted to person trait levels are overused. Item overexposure is not as disadvantageous as

with tests of ability where familiarity with an item may invalidate the item, but it does have

repercussions for items that are underused. Generally, underused items become arbitrary as

there is little data to evaluate them. Although item under- and overexposure can easily be

remedied by altering the exposure rate of items, or generating ‘testlets’ where different items

188

can be used in parallel form adaptive item-banks, item exposure balancing needs to be

conducted for real-world computer adaptive testing.

Content balancing on the other hand, refers to the degree of representation that different

items from different constructs or sub-constructs enjoy during computer adaptive

administration (Lu et al., 2010). Content balancing is an important consideration when

developing and using computer adaptive tests. Firstly, a certain number of items need to be

used that measure different domains so that person measures on these domains can be

accurately estimated. It is of no use to have 20 items that measure only Conscientiousness and

no items that measure Extraversion in a personality test. Content balancing thus needs to be

done on the scale level of personality tests. However, since personality tests measure distinct

constructs at the scale level, computer adaptive tests of personality can ideally be independently

administered per scale to ensure content coverage of the five personality factors.

However, where content balancing becomes critical is at the subscale level. At this level

enough items have to be administered from a number of subdomains to accurately and precisely

measure and report at the total score level. For example, the Extraversion scale may report on

an Extraversion total score, but is this total score calculated using all the subdomains of

Extraversion (i.e., Ascendance, Liveliness, Positive-Affectivity, Gregariousness, and

Excitement Seeking) or only one or two? Furthermore, different test-takers may be

administered very different items from different subdomains thus making comparison between

the person parameter estimates of test-takers difficult.

Since general factor dominance was demonstrated for the BTI scales (refer to Chapter

3) and only the person parameters at the total score level (scale level) were compared, content

balancing was not used in this study. Additionally, not enough items were available for each

facet of the BTI to make content balancing feasible. However, content balancing is an

important feature that should be included in real-time computer adaptive testing, ensuring that

189

enough items are available for administration at the subscale level. Three content balancing

techniques are popular in the literature namely the Constrained Model (Kingsbury & Zara,

1989), the Modified Multinomial Model (Chen, Ankenmann, & Spray, 1999) and the Modified

Constrained Model (Leung, Chang, & Hau, 2000). Generally, the use of the Modified

Multinomial Model and the Modified Constrained Model are recommended for modern content

balancing in computer adaptive attitudinal measures (Leung, Chang, & Hau, 2003).

5.4.6. Conclusion and final comments

The results of this study therefore indicate that the adaptive core test-scale versions were

more accurate when estimating person parameters than the adaptive full test-scale versions.

Also, the adaptive core test versions, although using on average more items to estimate person

parameters, remained highly efficient when compared to the non-adaptive full-form test-scale

versions.

Conversely, the adaptive full test-scales were more precise when estimating final person

parameter location values and used on average fewer items to do so than the adaptive core test-

scales. In addition, item efficiency was much higher for the adaptive full test-scale versions

than the adaptive core test-scale versions when compared to their respective non-adaptive test

forms.

When these aspects are taken into account it should be argued that the more accurate, not

precise or necessarily efficient, test version be used in practice. This is because the purpose of

psychometric measurement is to accurately estimate test-takers’ standing on the latent construct

of interest and this tenant should not be compromised. There are also other mitigating factors

that motivate the use of the adaptive core test version over the adaptive full test version. Firstly,

as stated before the adaptive core tests are more accurate at estimating person parameters than

the adaptive full test-scales. Secondly, the difference between the precision of measurement

190

for the adaptive core and adaptive full test-scales was marginal. Thirdly, adaptive core test-

scales only used on average two items more than the adaptive full test-scales, which is a small,

if not negligible, difference. And finally, the adaptive core test-scales remained highly efficient

when compared to the non-adaptive full form test-scales.

It is also possible to write more items for the adaptive core item banks to increase

precision of measurement and bolster item usage efficiency. If these aspects are taken into

account, the adaptive core test versions can be considered superior to both the non-adaptive

full form test versions and adaptive full test versions for the scales of the BTI. Considering

these results, it can be confidently argued that the time for computer adaptive testing of

personality has truly come.

191

CHAPTER 6: DISCUSSION AND CONCLUSION

“Moreover, computerized psychological assessment introduces the possibility of more

advanced test administration procedures, which were impossible to implement in

conventional paper-and-pencil testing.” – Hol et al. (2008, p.12).

6.1. Introduction

In this chapter the Chapters 3, 4, and 5 discussions of the studies conducted are

integrated. The background of the three studies is briefly discussed and the objectives of the

studies are revisited. The unique contributions of the studies and the findings for each objective

are then provided. The chapter concludes by making recommendations for practice, discussing

the limitations of the studies, and making suggestions for future research.

6.1.1. Aims and objectives of the three studies

The superordinate aim of this research study was to prepare and evaluate a personality

inventory for computer adaptive test application. Generally, this objective aimed to address the

apparent lack of progress made in the field of computer adaptive personality testing (refer to

Chapter 1 and 2 for a more detailed discussion). Two initial objectives were therefore important

namely (1) evaluating the psychometric properties of a personality inventory to determine

whether computer adaptive test application was feasible and, (2) testing how a personality

inventory would function as a computer adaptive test.

To meet these objectives this research study was broken into three independent studies.

The first study (Chapter 3) investigated the dimensionality of the BTI, a personality inventory

developed in South Africa that is based on the five-factor taxonomy, whereas the second study

evaluated the psychometric properties of the BTI using item response theory to prepare the

scales of the BTI for computer adaptive testing (Chapter 4). The final study (Chapter 5)

192

simulated the BTI as a computer adaptive test within and computer adaptive testing framework

to determine whether it was comparable to its own non-computer adaptive version, which acted

as a benchmark for the computer adaptive test-scale versions. These steps taken in these three

studies are required when evaluating a test for computer adaptive test applications. The

rationale and objectives of each of these studies is discussed in the next section.

6.1.2. Study 1 objectives: The dimensionality of the BTI scales

Because computer adaptive tests select items measuring specific latent constructs to

administer to test-takers, and because these items are used to estimate the trait level of test-

takers, each set of items (measurement scales) used in computer adaptive testing must

demonstrate the measurement of a dominant single construct. This is referred to as

unidimensional measurement, which is a prescription for computer adaptive testing and testing

in general. If items used in an item-bank for computer adaptive testing do not primarily measure

a single ‘known’ construct, then item and person parameters derived from computer adaptive

testing may be inaccurate and imprecise. Therefore, the main objective of the first study was

to investigate the dimensionality of the BTI on the factor/scale and facet/subscale levels.

Ideally, each factor/scale of the BTI should demonstrate unidimensionality as the computer

adaptive test would measure at the factor/scale level and not the facet/subdimensional level.

The main objective of this study was therefore to determine whether each of the factors/scales

of the BTI demonstrated sufficient evidence for unidimensionality to justify fitting the scales

to an item response theory model to prepare the scales for computer adaptive testing

applications.

193

6.1.3. Study 2 objectives: Fitting the BTI scales to the Rasch model: Evaluation and

selection of a core item bank for computer adaptive testing

Computer adaptive tests ultimately estimate person parameters, on the construct of

interest, using an item-response theory framework. It is therefore necessary to investigate how

well items of a test fit an item-response theory model. Poor fit to an item-response theory model

would not bode well for computer adaptive testing and may negate the precision and accuracy

of person trait estimation with a computer adaptive framework (refer to Chapter 4 for a more

in-depth discussion). The objective of the second study was therefore to evaluate how well the

items of the factors/scales of the BTI fit the Rasch rating scale model. A secondary objective

of this study was to generate item difficulty parameters to use within a computer adaptive

testing framework and remove items that may not function effectively within this framework.

6.1.4. Study 3 objectives: An evaluation of the simulated Basic Traits Inventory

computer adaptive test

In essence, the first two studies evaluated and prepared the scales of the BTI for computer

adaptive test application. Study 3 (Chapter 5) actually simulates the scales of the BTI as a

computer adaptive tests to determine how equivalent the person parameter estimates of the

computer adaptive tests were to the person parameters estimated using the non-computer

adaptive versions of the test-scales. Ideally, person parameter estimates should be

approximately equivalent across the non-adaptive and adaptive versions of the BTI scales for

practical computer adaptive testing to be considered. Other important considerations should

also be taken into account such as the item efficiency of the computer adaptive test and the

precision with which the computer adaptive test estimates person parameters while making

item efficiency gains. Therefore, the main objective of this study was to evaluate the

194

performance of the computer adaptive versions of the scales of the BTI relative to their non-

adaptive full form counterparts.

With these objectives in mind, the results and outcomes of each of these studies is

discussed in the next three sections. First, an overview of the findings for each study is

presented. The limitations of the three studies are then discussed and some recommendations

for future research are made.

6.2. Discussion of Results for the Three Studies

As mentioned in the previous section, the main objectives of study 1 and study 2 was to

investigate the psychometric properties of the BTI to evaluate whether the BTI could be used

as a computer adaptive test. The third study evaluated how well the scales of the BTI performed

within a computer adaptive framework. Consequently, the results of study 1 and 2 are first

discussed, followed by a discussion of the results study 3.

6.2.1. Study 1 results: The dimensionality of the BTI scales

In the first study three confirmatory factor analytic models namely: (1) a general factor

model (where a common factor accounts for the majority explained variance); (2) a group

factor model (where sub-factors account for the majority explained variance); and (3) a bifactor

model (where the general factor accounts for the largest proportion of the explained variance

while sub-factors account for a smaller proportion of the unique variance simultaneously) were

evaluated for each scale of the BTI. This was done in order to determine whether a single

unidimensional construct (i.e., the general factor) or a number of constructs (i.e., the group

factors) accounted for the majority of the explained variance for each scale of the BTI. This is

important because if the general factor model is ‘best fitting’ then interpretation of a total score

195

at the scale level of the BTI is justified. However, if the group factor model demonstrates the

best fit, then the interpretation of a total scale score is not justified. If the latter were true, then

computer adaptive test-scales would have to be developed for each facet or subscale. Total

score interpretation at the scale level, on the other hand, would be required if a computer

adaptive test is developed to measure a single unidimensional construct at the scale level as

was the aim of this study (i.e., Extraversion, Neuroticism, Conscientiousness, Openness, and

Agreeableness.

The results indicated the presence of strong general factors for each of the five BTI scales.

Only Extraversion demonstrated some limited evidence of multidimensionality, with the

Excitement Seeking sub-factor explaining some unique variance beyond the general

Extraversion factor. Although each scale demonstrated general factor dominance, in the sense

that a general factor accounted for the largest proportion of explained variance for each scale

of the BTI, each sub-factor also explained unique variance not attributed to the general factor.

The results therefore suggested that a common factor model was not strictly applicable

to the scales of the BTI. In other words, each sub-factor also explained some unique variance

beyond the general factor. Similarly, the group factor model also did not fit sufficiently well

to justify pure interpretation at the facet level. This result indicated that computer adaptive

testing at the subscale level would not be feasible or psychometrically plausible. Results thus

suggested that a bifactor model was the most applicable for each scale of the BTI. A bifactor

model posits that each scale of a test has a common factor (general factor) that accounts for the

largest proportion of the explained variance, as well as a number of group factors that account

for a smaller proportion of the unique variance beyond the common factor. This was not

unexpected however, as personality constructs are not usually orthogonal, and the BTI is a

hierarchical personality inventory with well-defined factors (scales) and sub-factors (referred

to as facets). It was thus hypothesised that the bifactor model would be the best fitting model

196

because this model purports the use of both a general factor and a number of group factors.

However, to justify the use of a computer adaptive test format for each scale of the BTI, the

bifactor model should demonstrate that the general factor still accounts for the largest

proportion of explained common variance, whereas the group factors (sub-factors) account for

a markedly smaller proportion of the unique variance. This was indeed the case even though

each model (common factor, group factor, and bifactor) demonstrated evidence of

multidimensionality, in the sense that multiple factors are measured (i.e., the general/common

factor and four or five group factors) the use of total scores for each factor was not unjustified.

Technically however, the bifactor model also indicates possible interpretation at the sub-

factor level, but the degree of general factor dominance as well as the lack of sufficient items

at the subscale level for computer adaptive test applications rendered this option wholly

unfeasible. This was primarily because the general/common factor still accounted for most of

the explained variance in each scale of the BTI, while a smaller proportion of the unique

variance was accounted for by the sub-factors.

The Excitement Seeking subscale of the BTI, although accounting for a larger proportion

of the unique variance beyond the general Extraversion factor, still demonstrated relative

general factor dominance. It was thus decided that this scale would remain ‘intact’ when

evaluating the items of the BTI using a one-dimensional item response theory model (i.e., the

Rasch rating scale model) in study 2. If fit to this model for the Extraversion scale indicated

substantially poor fit indices for most items, Excitement Seeking would be dropped for

computer adaptive test adaptation. However, with the removal of a limited number of poor

fitting items for the Extraversion scale good fit to the Rasch model was achieved and thus

Excitement Seeking was retained for computer adaptive test application. Scale fit to the Rasch

rating scale model is elaborated on in the next section.

197

6.2.2. Study 2 results: Fitting the BTI scales to the Rasch model

Once the dimensionality of the BTI scales was judged ‘sufficiently unidimensional’; the

items of the five scales of the BTI were fit to the one-dimensional Rasch rating scale model.

The objective of fitting the items of the scales of the BTI to the Rasch rating scale model was

twofold. Firstly, it was determined whether the fit of the data to the Rasch rating scale model

was sufficient to justify computer adaptive testing on the scale level. Secondly, item and person

parameters were developed for use within a computer adaptive testing framework for each of

the five scales of the BTI.

With regard to the first objective, each scale of the BTI was evaluated for fit using the

infit and outfit mean square statistics, person and item separation indices and reliability, rating

scale performance indices, and DIF across gender and ethnicity.

The majority of the items of the BTI demonstrated good fit to the Rasch rating scale

model with some items not fitting the model sufficiently well to justify their inclusion. Person

and item separation indices were also, for the most part, satisfactory with item and person

reliability indicating similarly satisfactory results. Although the items demonstrated uniform

or non-uniform DIF by gender and ethnicity, some items also demonstrated DIF for both gender

and ethnicity jointly. Poor fitting items and items that demonstrated DIF were removed and the

revised item pool for each scale demonstrated satisfactory fit to the Rasch rating scale model.

In total, the items of the scales of the BTI were reduced to 23 items for Extraversion,

Neuroticism, Conscientiousness, and Openness to Experience, and 29 items for Agreeableness.

Item and person parameters were thus generated for use in a computer adaptive testing

framework based on the reduced scales of the BTI. However, to demonstrate that the reduced

scales still measure the same constructs and are able to estimate person parameters equivalently

to the full scales, cross-plots the person parameters of the full scales (i.e., the scales where no

items were removed) with the person parameter estimates of the reduced scales (i.e., the scales

198

where items were removed due to DIF and misfit) were generated. Results indicated that there

was sufficient equivalence between the person parameter estimates of the full test and the

reduced test to justify the use of the reduced scales and their associated person and item

parameters for computer adaptive testing applications.

6.2.3. Study 3 results: An evaluation of the computer adaptive BTI

The final study in this research was to evaluate the performance of a computer adaptive

test of personality by contrasting it with the full non-adaptive version of the test. Because study

1 and 2 indicated that the items of the scales of the BTI meet the psychometric requirements

for computer adaptive test applications, the reduced scales of the BTI (scales that have had

poor fitting items and items that demonstrate DIF removed) were used within the computer

adaptive testing framework. However, it was also determined whether the reduced BTI scales

estimated person parameters differently (i.e., person parameter estimates and item efficiency)

when compared to the adaptive version of the full non-reduced scales. This was done in order

to contrast the two adaptive forms of the test with one another.

To reduce the impact of different computer adaptive testing procedures all item selection

rules, person parameter estimation functions, and stopping rules were held constant for each

scale and test version of the BTI. Results indicated that the computer adaptive versions of the

scales (both the full adaptive and core adaptive scales) outperformed the non-adaptive full and

core scales. Although fewer items were administered with a robust standard error for both the

adaptive full and adaptive core scales of the BTI, the greatest gains in item administration

efficiency was made between the non-adaptive full and adaptive full scales of the BTI.

Although the adaptive core scales still made item gains when compared to their non-adaptive

core counterparts these gains were not as dramatic as those made between the adaptive full and

non-adaptive full scales. This was simply because of the reduced item pool in both cases.

199

Generally, the adaptive versions of the BTI scales made efficiency gains of 50% or more for

both the adaptive full and adaptive core versions of the test when compared to their non-

adaptive counterparts. This was a major reduction in item administration while each scale still

maintained a robust standard error of person parameter estimation of ≤ .30. Additionally, the

adaptive core and adaptive full versions of the scales of the BTI were able to recover the person

parameter estimates of the non-adaptive full versions of the scales. These results indicate that

the computer adaptive versions of the scales of the BTI functioned effectively within a


6.3. Limitations and suggestions for future research

Although the three studies indicated that the BTI can be used in a computer adaptive

testing framework, while maintaining satisfactory precision of measurement, two opportunities

for future research can be earmarked. Firstly, the item pools used for computer adaptive testing

tend to be large. These item pools contain numerous items to ensure (1) test security where

‘testlets’ or alternate item pools are used in each computer adaptive testing process for different

groups of test-takers, and (2) these item pools cover the continuum of the latent constructs from

a very low level of the trait to a very high level of the trait. In classical testing a single test with

a fixed number of items is considered sufficient. However, in computer adaptive testing item

pools used in the adaptive process should be much larger than the scales used in classical test

theory. This is because the computer adaptive test may rely very heavily on a single cluster of

best performing items and exclude a number of functional items from administration (refer to

the discussion on item under and overexposure in Chapter 5).

To ensure that test-takers receive test items that are relevant and varied is important

for face validity and test security. Methodologically, the psychometrician would like to

generate data on most of the items in the item bank so that their psychometric properties can

200

be further investigated or that test items can be revised over time. This can be achieved by

making use of content balancing procedures (refer to the discussion on content balancing in

Chapter 5). In this study only a fixed set of items from the original fixed form test were

available. Additionally, the computer adaptive test-scales reported on the scale level only.

Because of these considerations content balancing was not included in this study.

However, in future studies reporting on the practical testing of the computer

adaptive scales content balancing will need to be applied. It was also recommended that fixed

branching mechanisms – where a certain number of items are administered to all test-takers in

a non-adaptive manner after which the test adaptively selects from the items not used – were

implemented. In this way data on items can be obtained if the adaptive test does not administer

a certain cluster of items sufficiently. If the BTI scales are to be used in a computer adaptive

manner for practical testing, then more items will need to be written to bolster the item pool

over time. With enough items ‘testlet item pools’ can be created where the larger item pool is

split into a number of smaller independent item pools in order to maximise item exposure.

These item pools can then be administered in an alternate manner thus improving test-security

and ensuring that enough data is obtained for each item within each item pool.

It was further recommended that future research include the creation of items that

target trait levels at the extreme higher and lower ends of the trait-continuum for each BTI

scale. Although the endorsability of items was well-targeted to the trait level of the sample

under investigation, smaller groups of individuals at the extremes of the distribution did not

have items targeted as well to their particular trait levels. Generating items that target

individuals with extreme trait levels will further improve the precision of measurement and

may also further reduce the number of items that need to be administered during assessment.

It was also recommended that future research concentrate on the test-mode effects

of real-time computer adaptive testing. Although computer adaptive test simulation

201

approximates the computer adaptive process the real influence of the mode of testing on the

test-taker needs to be determined. These testing-mode effects can have implications for test-

takers who are not familiar with this type of testing and may/may not impact on test-validity.

6.4. Implications for practice

The major outcome of this study was to demonstrate that personality can be successfully

measured using computer adaptive testing procedures. The conservative nature of

psychometric testing has resulted in a multitude of tests using classical test theory techniques,

but with few making progress into the computer adaptive testing domain. It is important to

emphasise however, that classical test theory techniques of test construction are just as relevant

today as they have been in the past, but innovation and progress is needed in the psychometric

testing domain to ensure that testing remains relevant and up-to-date.

Computer adaptive tests reduce testing time; increase the relevance of items to test-takers

because the item difficulty is matched to person ability, or traitedness; increase test security;

and allow for quicker test-result reporting. In the case of the BTI up to 50% of the items of

each scale were not administered while the scales approximated personality traits with accuracy

and precision. The reduction in testing time and the practical and cost benefits of this

improvement in efficiency already justifies the research and development of computer adaptive

tests of personality. With the continued popularity of personality testing and the burden that

tests with many items place on the test-taker, computer adaptive testing can greatly improve

the testing experience for the test-taker and the test-administrator. However, test-developers

need to be aware of the increased sophistication, capital cost and complexity of computer

adaptive testing systems.

Another implication for practice is the improved test-security associated with the item

pools of computer adaptive tests. No two computer adaptive tests are the same as each test-

202

taker’s relative ‘traitedness’ informs item selection and administration. In this way it becomes

difficult for test-takers to fake on a computer adaptive test or to learn which items are associated

with specific constructs. Test security is further improved in the sense that ‘testlets, or’ multiple

alternate item pools, can be used as well.

Finally, the greatest strength of using computer adaptive testing in the personality domain

is that item invariance is a prerequisite for computer adaptive testing. In this way different sets

of items can be used to rank test-takers on the same trait continuum. This greatly improves the

flexibility of the testing process for test administrators.

6.5. Conclusion

The main objective of this study was to prepare and evaluate a test of personality for

computer adaptive testing applications. The results demonstrated that a computer adaptive test

of personality can function as effectively as a non-computer adaptive test of personality while

improving the relevance and efficiency of the test. In conclusion the computer adaptive testing

of personality is an area of great promise with numerous applications in practice.

203

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,

561-573. doi: 10.1007/BF02293814

Andrich, D., & Hagquist, C. (2014). Real and artificial differential item functioning in

polytomous items. Educational and Psychological Measurement, 1-23. doi:

10.1177/0013164414534258

Apple, M. T., & Neff, P. (2012). Using Rasch measurement to validate the Big Five Factor

Marker Questionnaire for a Japanese university population. Journal of Applied

Measurement, 18, 276-296.

Attali, Y., & Powers, D. (2008). Effect of immediate feedback and revision on psychometric

properties of open-ended GRE subject test items. Princeton, NJ: Educational Testing

Service.

Babcock, B., & Weiss, D. J. (2009). Termination criteria in computerized adaptive tests.

Variable-length CATs are not biased. In, D. J. Weiss (Ed.), Proceedings of the 2009

GMAC conference in computerized adaptive testing. Retrieved from

http://www.psych.umn.edu/psylabs/CATCentral/.

Baghaei, P. (2008). Local dependency and Rasch measures. Rasch Measurement

Transactions, 21, 1105-1106.

Ben-Porath, Y. S., & Butcher, J. N. (1986). Computers in personality assessment: A

brief past, and ebullient present, and an expanding future. Computers in Human

Behaviour, 2, 167-182. doi: 10.1016/0747-5632(86)90001-4

Betz, N. E., & Turner, B. M. (2011). Using item response theory and adaptive testing in

online career assessment. Journal of Career Assessment, 19, 274-286. doi:

10.1177/1069072710395534

http://dx.doi.org/10.1016/0747-5632(86)90001-4

204

Betz, N. E., & Weiss, D. J. (1976). Psychological effects of immediate knowledge of

results and adaptive ability testing: Research report 76-4. Arlington, VA: Office

of Naval Research.

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological

Bulletin, 107, 238-246. doi: 10.1037/0033-2909.107.2.238

Bjorner, J. B., Kosinski, M., & Ware, J. E. (2005). Computerized adaptive testing and

item banking. In, P. Fayers, and R. Hays (Eds.), Assessing quality of life in clinical

trials (2nd ed.). Los Angeles, CA: Oxford University Press.

Bland, M. J., & Altman, D. G. (1995). Multiple significance tests: The Bonferroni

method. British Medical Journal, 310, 170. doi: 10.1136/bmj.310.6973.170

Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a

microcomputer environment. Applied Psychological Measurement, 6, 431-444.

doi: 10.1177/014662168200600405

Bollen, K. A. (1989). Structural equations with latent variables. New York: John Wiley

& Sons.

Bond, T. G. (2003). Validity and assessment: a Rasch measurement perspective.

Metodologia de las Ciencias del Comportamiento, 5, 179-194.

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in

the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences.

Dodrecht, Netherlands: Springer.

Browne, M. W. & Cudeck, R. (1993). Alternative ways of assessing model fit. In, K. A.

Bollen and J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Beverly

Hills, CA: Sage

205

Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement test

batteries: Research report number 77-6. Minneapolis, MN: University of Minnesota.

Carter, J. E., & Wilkinson, L. (1984). A latent trait analysis of the MMPI. Multivariate

Behavioral Research, 19, 385-407. doi: 10.1207/s15327906mbr1904_2

Chen, S., Ankenmann, R. D., & Spray, J. A. (1999, April). The relationship between

item exposure rate and test overlap rate in computerized adaptive testing. Paper

presented at the annual meeting of the National Council on Measurement in

Education. Montreal, Canada.

Chen, F. F., West, S., & Sousa, K. (2006). A comparison of bifactor and second-order models

of quality of life. Multivariate Behavioral Research, 41, 189–225.

doi:10.1207/s15327906mbr4102_5

Chien, T. W., Wu, H. M., Wang, W. C., Castillo, R. V., & Chou, W. (2009). Reduction

in patient burdens with graphical computerized adaptive testing on the ADL scale:

Tool development and simulation. Health and Quality of Life Outcomes, 7, 39-44.

doi: 10.1186/1477-7525-7-39

Choi, S. W. (2009). Firestar: Computerized adaptive testing simulation program for

polytomous item response theory models. Applied Psychological Measurement,

33, 644-645. doi: 10.1177/0146621608329892

Choi, S. W., Grady, M. W., & Dodd, B. G. (2010a). A new stopping rule for computer

adaptive testing. Educational and Psychological Measurement, 70, 1-17. doi:

10.1177/0013164410387338.

Choi, S. W., Reise, S. P., Pilkonis, P. A., Hays, R. D., & Cella, D. (2010b). Efficiency

of computer adaptive short forms compared to full-length measures of depressive

symptoms. Quality of Life Research, 19, 125-136. doi: 10.1007/s11136-009-9560-

5

206

Choi, S. W., & Swartz, R. J. (2009). Comparison of CAT item selection criteria for

polytomous items. Applied Psychological Measurement, 33, 419-440. doi:

10.1177/0146621608327801.

Claassen, N.C.W., Meyer, H. M., & van Tonder, M. (1992). Manual for the General

Scholastic Aptitude Test (GSAT) Senior: Computer Adaptive Test. Pretoria, South

Africa: Human Sciences Research Council.

Clifton, S. (2014). Dimensionality of the Neuroticism Basic Traits Inventory scale.

Unpublished Masters Dissertation. Johannesburg, South Africa: University of

Johannesburg.

Costa, P. T., & McCrae, R. R. (1995). Domains and facets: Hierarchical personality

assessment using the revised NEO Personality Inventory. Journal of Personality

Assessment, 64, 21-50. doi: 10.1207/s15327752jpa6401_2

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.

Orlando, FL: Holt, Rinehart and Wilson.

Davis, L. L., & Dodd, B. G. (2005). Strategies for controlling item exposure in computerized

adaptive testing with the partial credit model. Pearson Research Reports.

de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The

Guilford Press.

de Beer, M. (2005). Development of the Learning Potential Computerised Adaptive

Test (LPCAT). SA Journal of Psychology, 35(4), 717-747. doi:

10.1177/00812463050350040

de Bruin, G. P. (2014). Measurement invariance of the Basic Traits Inventory in South

African job-applicants: A cross-cultural test of the Five-Factor model of personality.

Article submitted for review to Personality and Individual Differences.

207

de Bruin, G. P., & Rudnick, H. (2007). Examining the cheats: The role of conscientiousness

and excitement seeking in academic dishonesty. South African Journal of Psychology,

37, 153-164.

de Klerk, G. (2008). Classical test theory (CTT). In M. Born, C.D. Foxcroft & R. Butter

(Eds.), Readings in Testing and Assessment. International Test Commission.

DiBattista, D., & Gosse, L. (2006). Test anxiety and the Immediate Feedback

Assessment Technique. Journal of Experimental Education, 74, 311-327. doi:

10.3200/JEXE.74.4.311-328

Digman, J. M. (1989). Five robust trait dimensions: development, stability, and utility.

Journal of Personality, 57, 195–214. doi: 10.1111/j.1467-6494. 1989.tb00480.x

Digman, J. M. (1990). Personality structure: Emergence of the five-factor model.

Annual Review of Psychology, 41, 417-440. doi:

10.1146/annurev.ps.41.020190.002221

Dodd, B. G., de Ayala, R. J., & Koch, W. R. (1995). Computerized adaptive testing with

polytomous items. Applied Psychological Measurement, 19, 5-22. doi:

10.1177/014662169501900103

Eggen, J. H. M. (2012). Computerized adaptive testing item selection in computerized

adaptive learning systems. In, J. H. M. Eggen and B. P. Veldkamp (Eds.),

Psychometrics in practice at RCEC. Enscheded, Netherlands: University of

Twente.

Eggen, T. J. H. M., & Verschoor, A. J. (2006). Optimal testing with easy or difficult

items in computerized adaptive testing. Applied Psychological Measurement, 30,

379-393. doi: 10.1177/0146621606288890

208

Eisenhart, C. (1986). Laws of error II: The Gaussian distribution. In, S. Kotz and N. L.

Johnson (Eds.), Encyclopaedia of statistical sciences (Vol 4) (pp.547-562).

Toronto, Canada: Wiley.

Embretson, S. E., & Hershberger, S. L. (1999). Summary and future of psychometric

methods in testing. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules

of measurement: What every psychologist and educator should know. Mahwah,

NJ: Lawrence Erlbaum and Associates.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists.

Mahwah, NJ: Lawrence Erlbaum Publishers.

Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social,

behavioural, and health sciences. New York, NY: Routledge.

Fisher, W. P. (2008a). Other historical and philosophical perspectives on invariance in

measurement. Measurement, 6, 190-212. doi: 10.1080/15366360802265961

Fisher, W. P. (2008b). A social history of the econometric origins of some widely used

psychometric models. Manuscript submitted for publication to The European

Journal of the History of Economic Thought.

Fliege, H., Becker, J., Walter, O. B., Bjorner, J. B., Klapp, B. F., & Rose, M. (2005).

Development of a computer-adaptive test for depression (D-CAT). Quality of Life

Research, 14, 2277-2291. doi: 10.1007/s11136-005-6651-9

Flora, D. B., & Curran, P. J. (2004). The empirical evaluation of alternative methods of

estimation for confirmatory factor analysis with ordinal data. Psychological

Methods, 9, 466-491. doi: 10.1037/1082-989X.9.4.466

Forbey, J. D., & Ben-Porath, Y. S. (2007). Computerized adaptive personality testing: A

review and illustration with the MMPI-2 computerized adaptive version.

Psychological Assessment, 19, 14-24. doi: 10.1037/1040-3590.19.1.14

209

Forbey, J. D., Ben-Porath, Y. S., & Arbisi, P. A. (2012). The MMPI-2 computerized

adaptive version (MMPI-2-CA) in a VA medical outpatient facility. Psychological

Assessment, 24, 628-639. doi: 10.1037/a0026509

Forbey, J. D., Handel, R. W., & Ben-Porath, Y. S. (2000). A real data simulation of

computerized adaptive administration of the MMPI-A. Computers in Human Behavior,

16, 83-96. doi: 10.1037/1040-3590.4.1.26

Frey, A., & Seitz, N. N. (2009). Multidimensional adaptive testing in educational and

psychological measurement: Current state and future challenges. Studies in Educational

Evaluation, 35, 89-94. doi: 10.1016/j.stueduc.2009.10.007

Gershon, R. C. (2004). Computer adaptive testing. In E. V. Smith & R. M. Smith (Eds.),

Introduction to Rasch measurement (pp. 601 – 629). Maple Grove, MN: JAM

Press.

Georgiadou, E., Triantafillou, E., & Economides, A. A. (2006). Evaluation parameters

for computer-adaptive testing. British Journal of Educational Technology, 37,

261-278. doi: 10.1111/j.1467-8535.2005.00525.x.

Gibbons, R. D., Weiss, D. J., Kupfer, D. J., Frank, E., Fagiolini, A., Grochocinski, V. J.,

Bhaumik, D. K., Stover, A., Bock, R. D., & Immekus, J. C. (2008). Using

computerized adaptive testing to reduce the burden of mental health assessment.

Psychiatric Services, 59, 361-368. doi: 10.1176/appi.ps.59.4.361

Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure.

Psychological Assessment, 4, 26-42. doi: 10.1037/1040-3590.4.1.26

Grobler, S. (2014). The impact of language on personality assessment with the Basic Traits

Inventory. Unpublished doctoral thesis. Pretoria, South Africa: University of South

Africa.

http://psycnet.apa.org/doi/10.1037/1040-3590.4.1.26

http://dx.doi.org/10.1176%2Fappi.ps.59.4.361


210

Gu, L., & Reckase, M. D. (2007). Designing optimal item pools for computerized adaptive

tests with Sympson-Hetter exposure control. In, D. J. Weiss (Ed.), Proceedings of the

2007 GMAC conference on computerized adaptive testing. Retrieved from

http://www.psych.umn.edu/psylabs/CATCentral/

Hagquist, C., & Andrich, D. (2004). Detection of differential item functioning using analysis

of variance. Paper presented at the Second International Conference on Measurement in

Health. Perth, Australia: Murdoch University.

Haley, S. M., Ni, P., Hambleton, R. K., Slavin, M. D., & Jette, A. M. (2006). Computer

adaptive testing improved accuracy and precision scores over random item selection in

a physical functioning item bank. Journal of Clinical Epedemiology, 59, 1174-1182.

doi: 10.1016/j.jclinepi.2006.02.010

Hambleton, R. K., & Jones, R. W. (1993). A comparison of classical test theory and

item response theory and their applications to test development. Educational

Measurement: Issues and Practice, NCME Instructional Module, 38-46.

Harvey, R. J., & Hammer, A. L. (1999). Item response theory. The Counseling

Psychologist, 27, 353-383. doi: 10.1177/0011000099273004

Hart, D. L., Cook, K. F., Mioduski, J. E., Teal, C. R., & Crane, P. K. (2006). Simulated

computerized adaptive test for patients with shoulder impairments was efficient

and produced valid measures of function. Journal of Clinical Epidemiology, 59,

290-298. doi: 10.1016/j.jclinepi.2005.08.006

Higgins, D. M., Peterson, J. B., Lee, A. G. M., & Pihl, R. O. (2007). Prefrontal

cognitive ability, intelligence, Big Five personality, and the prediction of advanced

academic and workplace performance. Journal of Personality and Social

Psychology, 93, 298-319. doi: : 10.1037/0022-3514.93.2.298

http://dx.doi.org/10.1016/j.jclinepi.2006.02.010

211

Hobson, E. G. (2015). Using the Rasch model in a computer adaptive testing application to

enhance the measurement quality of emotional intelligence. Unpublished doctoral

thesis. Johannesburg, South Africa: University of Johannesburg.

Hogan, T. (2014). Using a computer-adaptive test simulation to investigate test coordinators’

perceptions of a high-stakes computer-based testing programme. Published doctoral

dissertation. Georgia State University, Georgia.

Hogan, R., & Hogan, J. (2007). Hogan Personality Inventory manual. Tulsa, OK:

Hogan Assessment Systems.

Hol, A. M., Vorst, H. C. M., & Mellenbergh, G. J. (2005). A randomized experiment to

compare conventional, computerized, and computerized adaptive administration of

ordinal polytomous attitude items. Applied Psychological Measurement, 29, 159-

183. doi: 10.1177/0146621604271268

Hol, A. M., Vorst, H. C. M., & Mellenbergh, G. J. (2008). Computerized adaptive

testing of personality traits. Journal of Psychology, 216, 12-21. doi: 10.1027/0044-

3409.216.1.12

Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a

bifactor solution. Supplementary Educational Monographs, no.48. Chicago, IL:

University of Chicago Press.

Hsu, C. L., Zhao, Y., & Wang, W. C. (2013). Exploiting computerized adaptive testing

for self-directed learning. Education in the Asia-Pacific Region: Issues, Concerns,

and Prospects, 18, 257-280. doi: 10.1007%2F978-94-007-4507-0_14

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure

analysis: Conventional criteria versus new alternatives. Structural Equation

Modeling: A Multidisciplinary Journal, 6, 1-55. doi: 10.1080/10705519909540118

212

IBM Corp. (2013). IBM SPSS statistics for Windows version 22.0. Amonk, NY: IBM

Corp.

Kaplan, R. M., & Saccuzzo, D. P. (2013). Psychological testing: Principles,

applications and issues (8th ed.). Belmont, CA: Wadsworth Cengage-Learning.

Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures

of fit in structural equation modeling. Structural Equation Modeling, 10, 333-351.

doi: 10.1207/S15328007SEM1003_1

Kersten, P., & Kayes, N. M. (2011). Outcome measurement and the use of Rasch

analysis: A statistics-free introduction. New Zealand Journal of Physiotherapy, 39,

92-99.

Kingsbury, G. G. (2009). Adaptive item calibration: A process for estimating item

parameters within a computerized adaptive test. In D. J. Weiss (Ed.), Proceedings

of the 2009 GMAC conference on computerized adaptive testing.

Kingsbury, G. G., & Zara, A. R. (1989). Procedures for selecting items for

computerized adaptive tests. Applied Measurement in Education, 2, 359-375. doi:

10.1207/s15324818ame0204_6

Kline, T. J. B. (2005). Psychological testing: A practical approach to design and

evaluation. Thousand Oaks, CA: Sage Publications.

Koch, W. R., & Dodd, B. G. (1995). An investigation of procedures for computerized

adaptive testing using the successive intervals Rasch model. Educational and

Psychological Measurement, 55, 976-990. doi: 10.1177/0013164495055006006

Kreitzberg, C. B., Stocking, M. L., & Swanson, L. (1978). Computerized adaptive

testing: Principles and directions. Computers & Education, 2, 319-329. doi:

10.1016/0360-1315(78)90007-6

http://dx.doi.org/10.1016/0360-1315(78)90007-6

213

Lai, J. S., Cella, D., Chang, C. H., Bode, R. K., & Heinemann, A. W. (2003). Item

banking to improve, shorten and computerize self-reported fatigue: An illustration

of steps to create a core item bank from the FACIT-Fatigue Scale. Quality of Life

Research, 12, 485-501. doi: 10.1023%2FA%3A1025014509626

Leung, C-K., Chang, H-H., & Hua, K-T. (2000, April). Content balancing in stratified

computerized adaptive designs. Paper presented at the annual meeting of the

American Educational Research Association. New Orleans, Los Angeles.

Leung, C-K., Chang, H-H., & Hua, K-T. (2003). Computerized adaptive testing: A

comparison of three content balancing methods. The Journal of Technology,

Learning and Assessment, 5, 1-16. Available from http://www.jtla.org

Li, Y., Jiao, H., & Lissitz, R.W. (2012). Applying multidimensional IRT models in

validating test dimensionality: An example of K-12 large-scale science

assessment. Journal of Applied Testing Technology, 13, 1-27. Available online at

http://www.jattjournal.com/index.php/atp/article/view/48367

Linacre, J. M. (2014). Winsteps® Rasch measurement computer program. Beaverton,

Oregon: Winsteps.com

Linacre, J. M. (2010). When to stop removing items and person in Rash misfit analysis.

Rasch Measurement Transactions, 23, 1241.

Linacre, J. M. (2002a). Optimizing rating scale category effectiveness. Journal of Applied

Measurement, 3, 85-106. doi: 10.1.1.424.2811&rep=rep1&type=pdf

Linacre, J. M. (2002b). What do infit and outfit, mean-square and standardized mean? Rasch

Measurement Transactions, 16, 878.

Linacre, J. M. (2000). Computer-adaptive testing: A methodology whose time has

come. In S. Chae, U. Kang, E. Jeon & J. M. Linacre (Eds), Development of

214

Computerized Middle School Achievement Test. Seoul, South Korea: Komesa

Press.

Lord, F. N., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,

MA: Addison-Wesley.

Ma, S-C., Chien, T-W., Wang, H-H., Li, Y-C., & Yui, M. S. (2014). Applying

computerized adaptive testing to the negative acts questionnaire-revised: Rasch

analysis of workplace bullying. Journal of Medical Internet Research, 16, e50.

doi: 10.2196/jmir.2819.

Macdonald, P., & Paunonen, S. V. (2002). A Monte Carlo comparison of item and

person statistics based on item response theory versus classical test theory.

Educational and Psychological Measurement, 62, 921-943. doi:

10.1177/0013164402238082

Marais, I., & Andrich, D. (2008). Effects of varying magnitude and patterns of response

dependence in the unidimensional Rasch model. Journal of Applied Measurement,

9, 105-124.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,

149-174. doi: 10.1007/BF02296272

Maydeu-Olivares, A. (2001). Multidimensional item response theory modeling of

binary data: Large sample properties of NOHARM estimates. Journal of

Educational and Behavioural Statistics, 26, 115-132. doi:

10.3102/10769986026001051

McCrae, R. R. (2002). NEO-PI-R data from 36 cultures: further intercultural

comparisons. In R. R. McCrae, & J. Allik (Eds.), The Five-Factor Model of

Personality Across Cultures (pp. 105–125). New York, NY: Kluwer Academy.

215

McCrae, R. R., & , Allik, J. (2002). The Five-Factor Model of personality across

cultures. New York, NY: Kluwer Academy.

McCrae, R. R., & Costa, P. T. (2010). NEO inventories: Professional manual. Lutz, FL:

Psychological Assessment Resources Inc.

McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence

Erlbaum and Associates.

Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil

cognitive ability tests: A meta-analysis. Psychological Bulletin, 114, 449-458. doi:

10.1037/0033-2909.114.3.449

Meijer, R. R., & Neiring, M. L. (1999). Computerized adaptive testing: Overview and

introduction. Applied Psychological Measurement, 23, 187-194. doi:

10.1177/01466219922031310

Mellenbergh, G. J., & Vijn, P. (1981). The Rasch model as a loglinear model. Applied

Psychological Measurement, 5, 369-376.

Metzer, S. A., de Bruin, G. P., & Adams, B. G. (2014). Examining the construct validity of

the Basic Traits Inventory and the Ten-Item Personality Inventory in the South African

context. SA Journal of Industrial Psychology, 40(1), 1–9. doi:10.4102/sajip.v40i1.1005

Monseur, C., Baye, A., Lafontaine, D., & Quittre, V. (2011). PISA test format assessments

and the local independence assumption. IERI Monograph Series: Issues and

Methodologies in Large-Scale Assessments, 4, 131-155.

Morgan, B., & de Bruin, K. (2010). The relationship between the big five personality traits

and burnout in South African university students. South African Journal of Psychology,

40, 182-191.


216

Muthén, B. (1984). A general structural equation model with dichotomous, ordered

categorical, and continous latent varaible indicators. Psychometrika, 49, 115-132. doi:

10.1007/BF02294210

Ortner, T. M. (2008). Effects of changed item order: A cautionary note to practitioners

on jumping to computerized adaptive testing for personality assessment.

International Journal of Selection and Assessment, 16, 249-257. doi:

10.1111/j.1468-2389.2008.00431.x

Ortner, T.M., & Caspers, J. (2011). Consequences of test anxiety on adaptive versus

fixed item testing. European Journal of Psychological Assessment, 27, 157-163.

doi: 10.1027/1015-5759/a000062

Osborne, J. (2008). Best practices in quantitative methods: An introduction to Rasch

measurement. Sage Research Methods Online, 50-70. doi:

10.4135/9781412995627

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models:

Quantitative applications in the social sciences. London, England: Sage

Publications.

Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context

of adaptive mental testing. Journal of the American Statistical Association, 70,

351-356. doi: 10.1080/01621459.1975.10479871

Pallant, J. F., & Tennant, A. (2007). An introduction to the Rasch measurement model:

An example using the Hospital Anxiety and depression scale (HADS). British

Journal of Clinical Psychology, 46, 1-18. doi: 10.1348/014466506X96931

Paunonen, S. V. (1998). Hierarchical organisation of personality and prediction of behavior.

Journal of Personality and Social Psychology, 74, 538-556. doi: 10.1037/0022-

3514.74.2.538

http://psycnet.apa.org/doi/10.1027/1015-5759/a000062

217

Penfield, R. D. (2006). Applying Bayesian item selection approaches to adaptive tests using

polytomous items. Applied Measurement in Education,19, 1–20. doi:

10.1207/s15324818ame1901_1

Pitkin, A. K., & Vispoel, W. P. (2001). Differences between self-adapted and computerized

adaptive tests: A meta-analysis. Journal of Educational Measurement, 38, 235- 247.

doi: 10.1111/j.1745-3984.2001.tb01125.x

Pomplun, M., Frey, S., & Becker, D. F. (2002). The score equivalence of paper and

computerized versions of a seeded test of reading comprehension. Educational and

Psychological Measurement, 62, 337-354.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Chicago, IL: University of Chicago Press.

Rasch, G. (1961). On general laws and meaning of measurement in psychology. In J.

Neyman (Ed.), Proceedings of the fourth Berkeley Symposium on mathematical

statistics and probability. Berkeley, CA.

Ramsay, L. J., Taylor, N., de Bruin, G. P., & Meiring, D. (2008). The big five personality

factors at work: A South African validation study. In, J. Deller (Ed.), Research

contributions to personality at work (pp. 99-112). Munich, Germany: Rainer

HamppVerlag.

Raykov, T., & Marcoulides, G. A. (2006). A first course in structural equation modeling

(2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Inc.

R Core Team (2013). R: A language and environment for statistical computing. R

Foundation for Statistical Computing. Vienna, Austria: http://www.R-project.org/.

Reckase, M., D. (2009). Multidimensional item response theory: Statistics for social and

behavioral sciences. New York, NY: Springer.

218

Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological

measures in the presence of multidimensionality. Journal of Personality Assessment,

95, 129-140. doi: 10.1080/00223891.2012.725437

Reise, S. P., & Henson, J. M. (2000). Computerization and adaptive administration of

the NEO PI-R. Assessment, 7, 347-364. doi: 10.1177/107319110000700404

Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations:

Exploring the extent to which multidimensional data yield univocal scale scores.

Journal of Personality and Assessment, 92, 544-559. doi:

10.1080/00223891.2010.496477

Reise, S. P., Moore, T. M., & Maydue-Olivares, A. (2011). Target rotations and assessing the

impact of model violations on the parameters of unidimensional item response theory

models. Forthcoming in Educational and Psychological Measurement.

Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving

dimensionality issues in health outcomes measures. Quality of Life Research, 16, 19-31.

doi: 10.1007/s11136-007-9183-7

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item

response theory: Two approaches for exploring measurement invariance. Psychological

Bulletin, 114, 552-566. doi: 10.1037/0033-2909.114.3.552

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of

Statistical Software, 48, 1-36. Available online at

www.jstatsoft.org/article/view/v048i02/v48i02.pdf

Rothstein, M. G., & Goffin, R. D. (2006). The use of personality measures in personnel

selection: What does current research support? Human Resource Management

Review, 16, 155-180. doi: 10.1016/j.hrmr.2006.03.004

219

Rudner, L. M. (2014). An on-line, interactive, computer adaptive testing tutorial 11/98.

Retrieved online from http://echo.edres.org:8080/scripts/cat/catdemo.htm

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.

Psychometrika Monograph Supplement, No. 17.

Schmidt, K. M., & Embretson, S. E. (2003). Item response theory and measuring abilities. In,

J. A. Schinka, W. F. Velicer, & I. B. Weiner (Eds.), Hanbook of psychology volume 2:

Research methods in psychology (pp. 429 - 445). Hoboken, NJ: Wiley & Sons.

Segall, D. O. (2005). Computerized adaptive testing. Amsterdam, Netherlands: Encyclopedia

of Social Measurement.

Segall, D. O. (1996). Multidimensional adaptive testing. Psychomtrika, 61, 331-354. doi:

10.1007/bf02294343

Segall, D. O., & Moreno, K. E. (1999). Development of the computerized adaptive testing

version of the armed services vocational aptitude battery. In, F. Drasgow and J. B.

Olsen-Buchanan (Eds.), Innovations in computerized assessment. New York, NY:

Lawrence Erlbaum.

Sick, J. (2011). Rasch measurement in language education part 6: Rasch measurement and

factor analysis. SHIKEN: JALT Testing & Evaluation SIG Newsletter, 15, 15-17.

Sim, S. M., & Rasiah, R. I. (2006). Relationship between item difficulty and

discrimination indices in true/false-type multiple choice questions of a para-

clinical multidisciplinary paper. Annals of the Academy of Medicine, Singapore,

35, 67-71.

Simms, L. J., Goldberg, L. R., Roberts, J. E., Watson, D., Welte, J., & Rotterman, J. H.

(2011). Computerized adaptive assessment of personality disorder: Introducing the

CAT-PD project. Journal of Personality Assessment, 93, 380-389. doi:

10.1080/00223891.2011.57747

220

Smith, E. V., & Smith, R. M. (2004). Introduction to Rasch measurement: Theory,

models and applications. Maple Grove, MN: JAM Press.

Smits, N., Cuijpers, P., & van Straten, A. (2011). Applying computerized adaptive

testing to the CES-D scale: A simulation study. Psychiatry Research, 188, 147-

155. doi: 10.1016/j.psychres.2010.12.001

Stark, S., Chernyshenko, O. S., Drasgow, F., & White, L. A. (2012). Adaptive testing

with multidimensional pairwise preference items: Improving the efficiency of

personality and other non-cognitive assessments. Organizational Research

Methods, 15, 463-487. doi: 10.1177/1094428112444611

Statistics South Africa (2012). Census 2011 census in brief. Report No. 03-01-41. Pretoria,

South Africa: Statistics South Africa

Steiger, J. H. & Lind, J.C. (1980, June). Statistically-based tests for the number of common

factors. Paper presented for the annual Spring meeting of the Psychometric Society,

Iowa City, IA.

Streiner, D. L. (2010). Measure for measure: New developments in measurement and item

response theory. Canadian Journal of Psychiatry, 55, 180-186.

Taylor, N. (2004). The construction of a South African five-factor personality inventory.

Unpublished master’s dissertation. Rand Afrikaans University, Johannesburg,

South Africa.

Taylor, N. (2008). Construct, item, and response bias across cultures in personality

measurement. Unpublished doctoral thesis. Johannesburg, South Africa: University of

Johannesburg.

Taylor, N., & De Bruin, G. P. (2006). Basic Traits Inventory: Technical manual.

Johannesburg, South Africa: Jopie van Rooyen & Partners.

https://ujdigispace.uj.ac.za/bitstream/handle/10210/3250/Taylor.pdf?sequence=1

https://ujdigispace.uj.ac.za/bitstream/handle/10210/3250/Taylor.pdf?sequence=1

221

Taylor, N., & de Bruin, G. P. (2013). Basic Traits Inventory: Technical manual (3rd ed.).

Randburg, South Africa: JvR Psychometrics.

Taylor, N., & de Bruin, G. P. (2012). The Basic Traits Inventory. In S. Laher and K. Cockroft

(Eds.), Psychological assessment in South Africa: Research and applications.

Johannesburg, South Africa: LittleWhiteBakkie Publishers.

ten Holt, J. C., van Duijn, M. A. J., & Boomsma, A. (2010). Scale construction and

evaluation in practice: A review of factor analysis versus item response theory

applications. Psychological Test and Assessment Modeling, 52, 272-297.

Tennant, A., & Pallant, J. F. (2007). DIF matters: A practical approach to test

differential item functioning makes a difference. Rasch Measurement

Transactions, 20, 1082-1084.

Teresi, J. A., Ramirez, M., Lai, J-S., & Silver, S. (2008). Occurrences and sources of

differential item functioning (DIF) in patient-reported outcome measures:

Description of DIF methods, and review of measures of depression, quality of life

and general health. Psychology Science Quarterly, 50, 538-612.

Thissen, D., Reeve, B. B., Bjorner, J. B., & Chang, C-H. (2007). Methodological issues for

building item banks and computerized adaptive scales. Quality of Life Research, 16,

109-119. doi: 10.1007/s11136-007-9169-5

Thompson, T., & Way, D. (2007). Investigating CAT designs to achieve comparability with a

paper test. In, D. J. Weiss (Ed.), Proceedings of the 2007 GMAC Conferencd on

Computerized Adaptive Testing. Retrieved online from

http://www.psych.umn.edu/psylabs/catcentral/pdf%20files/cat07tthompson.pdf

Thompson, N. A., & Weiss, D. J. (2011). A framework for the development of computerized

adaptive tests. Practical Assessment, Research & Evaluation, 16. Available online:

http://pareonline.net/getvn.asp?v=16&n=1.

222

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-

554.

Traub, R. E. (1997). Classical test theory in historical perspective. Educational

Measurement: Issues and Practice, 16, 8-16. doi: 10.1111/j.1745-

3992.1997.tb00603.x

Triantafillou, E., Georgiadou, E., & Economides, A. A. (2008). The design and

evaluation of a computerized adaptive test on mobile devices. Computers &

Education, 50, 1319-1330. doi: 10.1016/j.compedu.2006.12.005

Tucker, L. R., & Lewis, C. (1973).A reliability coefficient for maximum likelihood factor

analysis. Psychometrika, 38, 1-10. doi: 10.1007/BF02291170

van der Linden, W.J. (1998). Bayesian item selection criteria for adaptive testing. Journal of

Educational and Behavioral Statistics, 22, 203-226. doi: 10.1007/BF02294775

van der Linden, W. J., & Pashley, P. J. (2000) Item selection and ability estimation in

adaptive testing. In, W. J. van der Linden and C. A. W. Glass (Eds.), Computerized

adaptive testing: Theory and practice. Amsterdam, Netherlands: Kluwer Academic

Publishers.

Veldkamp, B. P. (2003). Item selection in polytomous CAT. In, H. Yanai, A. Okada, K.

Shigemasu, Y. Kano, and J. J. Meulman (Eds.), New developments in psychometrics.

Tokyo, Japan: Springer-Verlag.

Veldkamp, B. P. , & van der Linden, W. J. (2000). Designing item pools for computerized

adaptive testing. In, W. J. van der Linden and C. A. W. Glas (Eds.), Computerized

adaptive testing: Theory and practice (pp.149-162). London, UK: Kluwer Academic

Publishers.

Verhelst, N. D., & Verstralen, H. H. F. M. (2008). Some considerations on the partial credit

model. Psicologica, 29, 229-254.

223

Vispoel, W. P., Rocklin, T. R., & Wang, T. (1994). Individual differences and test

administration procedures: A comparison of fixed-item, computerized-adaptive, and

self-adapted testing. Applied Measurement in Education, 53, 53-79. doi:

10.1207/s15324818ame0701_5

Vispoel, W. O., Wang, T., & Bleiler, T. (1997). Computerized adaptive and fixed-item

testing of music listening skill: A comparison of efficiency, precision, and concurrent

validity. Journal of Educational Measurement, 34, 43-63. doi: 10.1111/j.1745-

3984.1997.tb00506.x

Vogt, L., & Laher, S. (2009). The five factor model of personality and

individualism/collectivism in South Africa: An exploratory study. Psychology in

Society, 37, 39-54.

Walker, J., Böhnke, J.R., Cerny, T., & Strasser, F. (2010). Development of symptom

assessments utilising item response theory and computer adaptive testing – A practical

method based on a systematic review. Critical Reviews in Oncology/Hematology, 73,

47-67. doi: 10.1016/j.critrevonc.2009.03.007

Walter, O. B., Becker, J., Bjorner, J. B., Fliege, H., Klapp, B. F., & Rose, M. (2007).

Development and evaluation of a computer adaptive test for ‘Anxiety’ (Anxiety-CAT).

Quality of Life Research, 16, 143-155. doi: 10.1007/s11136-007-9191-7

Wang, X., Bo-Pan, W., & Harris, V. (1999). Computerized adaptive testing simulations using

real test-taker responses. Law School Admission Council Computerized Testing

Report. LSAC Research Report Series.

Wang, W-C., Chen, P-H., & Cheng, Y-Y. (2004). Improving measurement precision of test

batteries using multidimensional item response theory models. Psychological Methods,

9, 116-136. doi: 10.1037/1082-989x.9.1.116

http://dx.doi.org/10.1016/j.critrevonc.2009.03.007

224

Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive

testing: Issues, criteria and an example. Journal of Educational Measurement, 38,

19-49. doi: 10.1111/j.1745-3984.2001.tb01115.x

Wang, H. W., & Shin, C. D. (2010). Comparability of computerized adaptive and paper

and pencil tests. Test, Measurement and Research Services Bulletin, 13, 1-7.

Available online at

http://images.pearsonassessments.com/images/tmrs/tmrs_rg/Bulletin_13.pdf

Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item

response theory and computerized adaptive testing: a brief summary of ongoing

studies of widely used headache impact scales. Medical.Care, 38, 1173-1182.

Wauters, K., Desmet, P., & Van den Noortgate, W. (2010). Adaptive item-based

learning environments based on item response theory: Possibilities and challenges.

Journal of Computer Assisted Learning, 26, 549-562. doi: 10.1111/j.1365-

2729.2010.00368.x

Weiss, D. J. (2004). Computerized adaptive testing for effective and efficient

measurement in counseling and education. Measurement and Evaluation in

Counseling and Development, 37, 70-84.

Weiss, D. J. (2011). Better data from better measurements using computerized adaptive

testing. Journal of Methods and Measurement in the Social Sciences, 2, 1-27.

Weiss, D. J. (2013). Item banking, test development, and test delivery. In, K. F.

Geisinger, B. A. Bracken, J. F. Carlson, J. C. Hansen, N. R. Kuncel, S. P. Reise, &

M. C. Rodriguez (Eds). APA handbook of testing and assessment in psychology,

Vol. 1: Test theory and testing and assessment in industrial and organizational

psychology, (pp.185-200). Washington, DC: American Psychological Association.

225

Wise, S. G., & Kingsbury, G.G. (2000). Practical issues in developing and maintaining a

computerized adaptive testing program. Psicologia, 21, 135-155.

Wright, B. D., & Douglas, G. A. (1986). A rating scale model for objective

measurement. Mesa Psychometric Laboratory, Memorandum No.35.

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch

Measurement Transactions, 8, 370.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, Il: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement Essentials (2nd ed.). Wilmington, DC:

Wide Range, Inc.

Yu, C. H., Popp, S. O., DiGangi, S., & Jannasch-Pennell, A. (2007). Assessing

unidimensionality: A comparison of Rasch modeling, parallel analysis, and TETRAD.

Practical Assessment, Research & Evaluation, 12. Available online:

http://pareonline.net/getvn.asp?v=12&n=14

Zhou, X. (2012). Designing p-optimal item pools in computerized adaptive tests with

polytomous items. Unpublished doctoral dissertation. Michigan State University.

Zickar, M. J., & Broadfoot, A. A. (2009). Partial revival of a dead horse? Comparing

classical test theory and item response theory. In C. E. Lance & R. J. Vandenberg

(Eds.), Statistical and methodological myths and urban legends: Doctrine, verity

and fable in the organizational and social sciences. New York, NY: Routledge

Publishers.

Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and

Mcdonald’s ωH: their relations with each other and two alternative conceptualizations

of reliability. Psychometrika, 70, 123–133. doi:10.1007/s11336-003-0974-7

226

Zwick, R. (2009). The investigation of differential item functioning in adaptive tests. In, W.

J. van der Linden and C. A. W. Glas (Eds.), Statistics for the social and behavioral

sciences: Elements of adaptive testing (pp.331-352). New York: Springer.

227

Appendix A: Item usage statistics for the adaptive full and adaptive core test versions

Figure 5.1b. Item administration of the Extraversion core adaptive

scale.

0

200

400

600

800

1000

1200

1400

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Num

ber

of

Per

sons

Number of Items Administered

Figure 5.1a. Item administration of the Extraversion full adaptive

scale.

0

200

400

600

800

1000

1200

1400

5 8 11 14 17 20 23 26 29 32 35

Num

ber

of

Per

sons


228

Figure 5.2b. Item administration of the Neuroticism full adaptive scale.

0

200

400

600

800

1000

1200

1400

5 8 11 14 17 20 23 26 29 32

Num

ber

of

Per

sons


Figure 5.2b. Item administration of the Neuroticism core adaptive

scale.

0

200

400

600

800

1000

1200

1400

5 8 11 14 17 20 23

Num

ber

of

Per

sons


229

Figure 5.3a. Item administration of the Conscientiousness full adaptive

scale.

0

50

100

150

200

250

300

5 8 11 14 17 20 23 26 29 32 35 38 41

Num

ber

of

Per

sons


0

200

400

600

800

1000

1200

1400

5 8 11 14 17 20 23

Num

ber

of

Per

sons


Figure 5.3b. Item administration of the Conscientiousness core

adaptive scale.

230

Figure 5.4a. Item administration of the Openness full adaptive scale.

0

100

200

300

400

500

600

700

800

900

1000

5 8 11 14 17 20 23 26 29 32

Num

ber

of

Per

sons


Figure 5.4b. Item administration of the Openness core adaptive scale.

0

100

200

300

400

500

600

5 8 11 14 17 20 23

Num

ber

of

Per

sons


231

Figure 5.5a. Item administration of the Agreeableness full adaptive

scale.

0

200

400

600

800

1000

1200

5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Num

ber

of

Per

sons


Figure 5.5b. Item administration of the Agreeableness core adaptive

scale.

0

100

200

300

400

500

600

700

800

5 8 11 14 17 20 23 26 29

Num

ber

of

Per

sons


232

Appendix B: Maximum attainable information with successive item administration


item administered for the Extraversion full item bank.


item administered for the Extraversion core item bank.

233


item administered for the Neuroticism full item bank.


item administered for the Neuroticism core item bank.

234


item administered for the Conscientiousness full item bank.


item administered for the Conscientiousness core item bank.

235


item administered for the Openness full item bank.


item administered for the Openness core item bank.

236


item administered for the Agreeableness full item bank.


item administered for the Agreeableness core item bank.

237

Appendix C: Number of items administered across the trait continuum

Figure 5.11a. Number of items administered across the trait

continuum for the Extraversion full item bank.

Figure 5.11b. Number of items administered across the trait

continuum for the Extraversion core item bank.

238


continuum for the Neuroticism full item bank.


continuum for the Neuroticism core item bank.

239


continuum for the Conscientiousness full item bank.


continuum for the Conscientiousness core item bank.

240


continuum for the Openness full item bank.


continuum for the Openness core item bank.

241


continuum for the Agreeableness full item bank.


continuum for the Agreeableness core item bank.

Development and evaluation of a computer adaptive test of personality: the basic traits inventory

Documents

Transcript of Development and evaluation of a computer adaptive test of personality: the basic traits inventory