APPLICATIONS OF MODERN TEST THEORY IN SKIN CANCER · Item Response Theory: Applications of modern...

ITEM RESPONSE THEORY:

APPLICATIONS OF MODERN TEST THEORY

IN SKIN CANCER RESEARCH

Ngadiman Djaja

B. Psy (Hons), M.Ed (Research, Assessment & Evaluation)

Submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

CENTRE FOR RESEARCH EXCELLENCE IN SUN & HEALTH

Institute of Health and Biomedical Innovation

School of Public Health and Social Work | Faculty of Health

QUEENSLAND UNIVERSITY OF TECHNOLOGY

2017

Item Response Theory: Applications of modern test theory in skin cancer research i

Keywords

Computer adaptive test, item response theory, rasch model, risk factors, skin cancer,

sun-exposure behaviours, sun-protection behaviours.

ii Chapter 1: Introduction

Abstract

The overall objective of this PhD study was to assess the feasibility of applying Item

Response Theory (IRT) to self-reported skin cancer risk questionnaires. This process

was divided into separate studies described in the five articles presented in this

dissertation.

The first study used secondary data from the Queensland University of Technology’s

Skin Awareness study. The objective of the first study was to determine how well the

ten-item skin self-examination attitude scale fit the requirements of a Rasch rating

scale model. The Rasch rating scale model is the most common model used to

analyse Likert-scale questions. The skin self-examination attitude scale showed good

internal reliability; eight out of ten items exhibited fit the Rasch model, and thus

possessed unidimensional measurement characteristics. The skin self-examination

attitude scale can be improved in the future by adding items that measure a strong

positive attitude towards skin self-examination. This study was published in Health

and Quality of Life Outcomes:

(http://hqlo.biomedcentral.com/articles/10.1186/s12955-014-0189-x).

The second research study, “Changes in self-reported sun-protection behaviours due

to concern about vitamin D status,” investigated changes in skin cancer prevention

behaviour people may undertake due to concern about vitamin D. Any decrease in

skin cancer prevention behaviours may increase future skin cancer risk. The study

used secondary data from the AusD study. The study conducted a cross-sectional

survey across the four seasons (2009-10) and latitudes ranging from 19-43°S. The

survey assessed vitamin D attitudes and changes in sun protection behaviours arising

from concerns about low vitamin D levels. Rasch partial credit models were used to

illustrate the potential effect of changing sun-protection behaviours due to concern

about vitamin D. This study was published in Photochemistry and Photobiology:

(http://onlinelibrary.wiley.com/doi/10.1111/php.12582/full)

The third study, “Advantages of Mobile Computer-Adaptive Testing (CAT) to

Quickly Estimate Skin Cancer Risk,” used secondary data from the QSkin study.

This study was devoted to the application of item response theory for computer

http://hqlo.biomedcentral.com/articles/10.1186/s12955-014-0189-x

http://onlinelibrary.wiley.com/doi/10.1111/php.12582/full

Item Response Theory: Applications of modern test theory in skin cancer research iii

adaptive testing to reduce response burden in skin cancer risk assessment. The study

compared the efficiency of non-adaptive testing and computer adaptive testing

facilitated by the partial credit model derived calibration of the QSkin skin cancer

risk questionnaire. The use of computer adaptive testing led to smaller standard error

of the estimated measure than non-adaptive testing, with substantially higher

efficiency without loss of precision and reducing response burden by 48%, 66%, and

66% for dichotomous, rating scale, and partial credit models, respectively. This study

was published in Journal of Medical Internet Research

(http://www.jmir.org/2016/1/e22/).

The fourth study, “Diagnostic Discrimination of a Skin Cancer Risk Scale” used

secondary data from the QSkin skin cancer risk questionnaire. The study objective

was to calibrate existing skin cancer-related questionnaires using a partial credit

model and examine their predictive discrimination of non-melanoma skin cancer

prospectively. Diagnostic discrimination showed an area under the curve statistics of

.753 (p < .000), .530 (p < .000), and .487 (p=0.093), for the phenotype, sun exposure

behaviours, and sun protection behaviours subscales, respectively. A full paper was

presented at the International Outcome Measurement Conference in Chicago, April

2015.

The fifth study, “Development and Psychometric Evaluation of Skin Cancer Risk

Scale Utilising Item Response Theory” aimed to develop a skin cancer risk scale

with strong measurement qualities utilising a modern test theory approach. The study

combined the best questions from existing skin cancer questionnaires used in Studies

2, 3, and 4, then calibrated them using a partial credit item response theory model to

create a scale measuring underlying skin cancer risk. The study found that 50-items

within three skin cancer risk subscales had good psychometrics properties (validity

and reliability). A draft manuscript is presented in Chapter 6.

To the best of our knowledge, this work is the first study that comprehensively uses

an item response theory approach to analyse data collected from skin cancer-related

questionnaires and then develop a comprehensive questionnaire measuring skin

cancer risk. Overall, this dissertation explored current and item-response theory-

based approaches to skin cancer risk-related measurement and provides empirical

evidence regarding the benefits of integrating item-response theory modelling. The

five studies presented in this thesis demonstrate the advantages of item response

iv Chapter 1: Introduction

theory in various applications within skin cancer research, including providing a first

set of items that could form part of a future skin cancer risk measurement item bank.

The development of the SunAus scale extended item response theory application into

the skin cancer field. The successful implementation of the scale in discriminating a

person’s risk based on their phenotype provides a good model for future studies. The

scale offers improvements compared to previous measures, such as greater content

coverage, precision, and if used with computer adaptive testing, will be more

economical and less burdensome for people in predicting risk. More research is

required to establish the benefits of modern test theory to measure skin cancer risk

and behaviours more accurately. This thesis makes a significant contribution to

knowledge generation, public health practice, and policy-related issues in the field of

skin cancer prevention by working towards more efficient and precise measurement

tools for future use.

Item Response Theory: Applications of modern test theory in skin cancer research v

A Note Regarding Format

This dissertation is a thesis by publication. It contains five publications that have

either been published or are under blind-peer review by refereed journals; therefore,

the wording of the journals is as published. The logical flow of the thesis is

maintained by introducing these articles where they fit most appropriately into the

thesis structure. The thesis uses the AMA numbered referencing style, with each

publication chapter containing its own reference list, and the references for Chapters

1 and 7 contained in the main reference list at the end of the document. The articles

have been reconfigured to Word to provide consistent formatting throughout the

thesis. Moreover, tables and figures have been numbered continuously throughout

the thesis, for consistency.

vi Chapter 1: Introduction

Table of Contents

Keywords .................................................................................................................................. i

Abstract .................................................................................................................................... ii

A Note Regarding Format ........................................................................................................ v

Table of Contents .................................................................................................................... vi

List of Figures ......................................................................................................................... ix

List of Tables ........................................................................................................................... xi

List of Abbreviations ............................................................................................................. xiii

Definition of Key Terms ....................................................................................................... xiv

List of Publications and Presentations .................................................................................. xvii

Statement of Original Authorship ......................................................................................... xix

Acknowledgements ................................................................................................................ xx

Chapter 1: Introduction ...................................................................................... 1

Background .................................................................................................................... 1

Brief literature review .................................................................................................... 3 1.2.1 Skin cancer-related measures ............................................................................... 3 1.2.2 Brief overview of Test Theory ............................................................................. 7 1.2.3 Differences between Classical Test Theory and Item Response Theory ............. 8 1.2.4 Test Development using Classical Test Theory and Item Response

Theory ................................................................................................................ 23 1.2.5 1-Parameter Logistic (1-PL) Model or The Rasch Model ................................. 25 1.2.6 2-Parameter Logistic (2-PL) Model ................................................................... 26 1.2.7 3-Parameter Logistic (3-PL) Model ................................................................... 28 1.2.8 Partial Credit Model........................................................................................... 29 1.2.9 Rating Scale Model ............................................................................................ 30

Choosing a model ......................................................................................................... 30

Limitations of item response theory ............................................................................. 31

Purpose of this doctoral work ...................................................................................... 32

Research questions ....................................................................................................... 33

Significance of the thesis ............................................................................................. 34

Thesis outline ............................................................................................................... 35

Chapter 2: Evaluation of Skin Self-examination Attitude Scale Using an

Item Response Theory Model Approach ............................................................... 38

Abstract ........................................................................................................................ 41

Introduction .................................................................................................................. 42

Methods ........................................................................................................................ 43

Results .......................................................................................................................... 46

Discussion .................................................................................................................... 50

Item Response Theory: Applications of modern test theory in skin cancer research vii

Conclusion ....................................................................................................................52

References ....................................................................................................................53

Chapter 3: Self-reported Changes in Sun-Protection Behaviours at Different

Latitudes in Australia .............................................................................................. 56

Abstract .........................................................................................................................59

Introduction ..................................................................................................................60

Material and methods ...................................................................................................62

Results ..........................................................................................................................65

Discussion .....................................................................................................................68

Conclusion ....................................................................................................................70

References ....................................................................................................................71

Chapter 4: Estimating Skin Cancer Risk: Evaluating Mobile Computer

Adaptive Testing ...................................................................................................... 82

Abstract .........................................................................................................................85

Introduction ..................................................................................................................86

Methods ........................................................................................................................88

Results ..........................................................................................................................94

Discussion .....................................................................................................................96

Conclusions ..................................................................................................................99

References ..................................................................................................................101

Chapter 5: Diagnostic Discrimination of the Skin Cancer Risk (SCR) Scale:

Application of Item Response Theory .................................................................. 105

Abstract .......................................................................................................................108

Introduction ................................................................................................................109

Methods ......................................................................................................................109

Results ........................................................................................................................110

Discussion ...................................................................................................................118

References ..................................................................................................................119

Chapter 6: Development and Psychometric Evaluation of Item Banks for the

Assessment of Skin Cancer Risk Using Item Response Theory ......................... 121

Abstract .......................................................................................................................124

Introduction ................................................................................................................126

Methods ......................................................................................................................127

Methods ......................................................................................................................134

Discussion ...................................................................................................................147

Conclusions ................................................................................................................149

References ..................................................................................................................156

Chapter 7: Discussion ...................................................................................... 159

viii Chapter 1: Introduction

Summary of the Main Findings.................................................................................. 159

Discussion of the Main Findings................................................................................ 160 7.2.1 Item Response Theory as a tool for evaluating psychometrics properties

of a questionnaire. ............................................................................................ 160 7.2.2 Item Response Theory as a tool for developing a new a questionnaire. .......... 162 7.2.3 Use of Computer Adaptive Test to reduce participants’ burden. ..................... 162

The Assessment of Skin Cancer Risk. ....................................................................... 163

Methodological Considerations and Future Studies................................................... 166 7.4.1 Limitations of the research .............................................................................. 167 7.4.2 Suggestions for Future Implementation ........................................................... 169

Conclusion ................................................................................................................. 171

References ............................................................................................................... 173

Appendices .............................................................................................................. 193

Item Response Theory: Applications of modern test theory in skin cancer research ix

List of Figures

Figure 1.1-1: Illustration of IRT................................................................................... 3

Figure 1.1-2: Uniform differential item functioning .................................................. 10

Figure 1.1-3: Non-uniform differential item functioning .......................................... 11

Figure 1.1-4: Item Characteristic Curve from 2 dichotomous items ......................... 13

Figure 1.1-5: Category response function of a polytomous item ............................... 14

Figure 1.1-6: Item information functions for 8 items ................................................ 15

Figure 1.1-7: Standard error of measurement ............................................................ 18

Figure 1.1-8: Illustration of computer adaptive testing .............................................. 20

Figure 1.1-9: Sample of test linking ........................................................................... 21

Figure 1.1-10: One-Parameter Item Characteristic Curves for Four Typical

Items ............................................................................................................. 26

Figure 1.1-11: 2-Parameter Item Characteristic Curves for Four Typical Items ....... 27

Figure 1.1-12: 3-Parameter Item Characteristic Curves for Four Typical Items ....... 29

Figure 1.1-13: Overview of current doctoral work .................................................... 33

Figure 2.1: Wright map/item person map of Skin Self-Examination Attitude

Scale with the mean theta of person on the left and mean theta of

items on the right. ........................................................................................ 48

Figure 3.1: Item Information Functions from two items plotted along the latent

trait logits of skin cancer predisposition ...................................................... 79

Figure 3.2: Supplement 2. The distribution of the skin cancer predisposition

(scores converted to T score) ....................................................................... 81

Figure 4.1: Sample selection flowchart ...................................................................... 89

Figure 4.2: Study simulation and CAT flowchart. ..................................................... 91

Figure 4.3: Determining a cut-off point ..................................................................... 94

Figure 4.4: Generated with 3 Rasch Models .............................................................. 95

Figure 4.5: Efficiency and precision of CAT and compared to using 10, 20 or

30 items in static NAT format. .................................................................... 96

Figure 4.6 A graphical CAT report shown after each response (top) and the

more item length, the less standard errors in CAT process (bottom) .......... 97

Figure 5.1: Items person map of PH subscale. ......................................................... 113

Figure 5.2: Items person map of SE subscale. ......................................................... 114

Figure 5.3: Items person map of SP subscale. ......................................................... 115

Figure 5.4: Example of most probable response for a person with skin cancer

risk in PH scale of 0.5 logits. ..................................................................... 116

Figure 5.5: Category probability curves of item PH3 .............................................. 117

x Chapter 1: Introduction

Figure 5.6: ROC curve ............................................................................................. 118

Figure 6.1: Recruitment of participants. ................................................................... 128

Figure 6.2: Steps in data analysis ............................................................................. 132

Figure 6.3: The Wright map for phenotype subscale. .............................................. 142

Figure 6.4: Distribution of Standard Error Measurement for each domain: a)

phenotype, b) sun exposure, c) sun protection ........................................... 144

Figure 6.5: ROC Curves of outcome variables ........................................................ 146

Figure 6.6: A1. Sun exposure behaviours scale item map ....................................... 154

Figure 6.7: A2. Sun protection behaviours scale item map ..................................... 155

Figure 7.1: Roadmap for an International Skin Cancer Risk Item Bank ................. 171

Item Response Theory: Applications of modern test theory in skin cancer research xi

List of Tables

Table 1.1: Studies examining the psychometrics properties of skin cancer

related measures ............................................................................................. 5

Table 1.2: Differences between Classical and Item Response Theories .................... 22

Table 1.3: Steps in test development ......................................................................... 23

Table 1.4: Taxonomy of IRT Models ........................................................................ 25

Table 2.1: Item total correlation, fit statistics and item difficulty for the 10-item

Skin Self-Examination Attitude Scale ......................................................... 49

Table 2.2: DIF statistics for the 8-item skin self-examination attitude scale ............. 50

Table 3.1: Differences in vitamin D attitudes and sun protection behaviours by

location1 ....................................................................................................... 75

Table 3.2: Vitamin D-related attitudes and self-reported changes in sun

protection behaviours ................................................................................... 76

Table 3.3: Multivariable logistic regression models of associations between

vitamin D-related attitudes and changes made during the last summer

to the way people protected themselves from the sun so they can get

enough vitamin D* ....................................................................................... 77

Table 3.4: Item location and fit statistics of sun protection behaviour items

calibrated within a skin cancer predisposition model. ................................. 78

Table 3.5: Supplement 1. Demographic and Phenotypic characteristics of the

participants (n=1,002) .................................................................................. 80

Table 4.1: 10, 20, or 30 items in static NAT format. ................................................. 92

Table 4.2: Precision of CAT. ..................................................................................... 93

Table 4.3: Efficiency of CAT..................................................................................... 93

Table 5.1: Item parameter estimations and fit statistics of skin cancer risk

(SCR) scale ................................................................................................ 111

Table 5.2: Skin cancer risk score for each subscale in validation sample ............... 117

Table 6.1: Overview of items measured on the SunAus Scale ................................ 130

Table 6.2: Characteristics of Study Participants (N=1,177) and 2011 Australia

census – Queensland (QLD) State only [43] ............................................. 135

Table 6.3: Item parameter estimations and fit statistics of phenotype scale ............ 138

Table 6.4: Item parameter estimations and fit statistics of sun exposure

behaviours scale ......................................................................................... 139

Table 6.5: Item parameter estimations and fit statistics of sun protection

behaviours scale ......................................................................................... 140

Table 6.7: Supplement 1: Table for conversion of phenotype scale summed

item scores to Rasch measures ................................................................... 151

xii Chapter 1: Introduction

Table 6.8: Supplement 2: Table for conversion of sun exposure behavior scale

summed item scores to Rasch measures .................................................... 152

Table 6.9: Supplement 3: Table for conversion of sun protection behavior scale

summed item scores to Rasch measures. ................................................... 153

Item Response Theory: Applications of modern test theory in skin cancer research xiii

List of Abbreviations

1PL One-Parameter Logistic Model

2PL Two-Parameter Logistic Model

3PL Three-Parameter Logistic Model

CAT Computer Adaptive Test

CTT Classical Test Theory

DIF Differential Item Functioning

IRT Item Response Theory

PCM Partial Credit Model

RSM Rating Scale Model

ROC Receiver Operating Characteristics

SEM Standard Error of Measurement

xiv Chapter 1: Introduction

Definition of Key Terms

1PL-IRT An item response model that estimates one item parameter -

item difficulty.

2PL-IRT An item response model that estimates two item parameters -

item difficulty and item discrimination.

3PL-IRT An item response model that estimates three item parameters -

item difficulty, item discrimination, and pseudo guessing.

Ability The quality of being able to do something.

a parameter Known as item discrimination (slope) parameter in Item

Response Theory.

b parameter Known as item difficulty parameter in item response theory.

Bias The effect of any factor that the researcher did not expect to

influence the dependent variable.

c parameter Known as pseudo guessing parameter in item response theory.

Calibration

The procedure of estimating a person’s ability or item difficulty

by converting raw score to logits on an objective measurement

scale.

Classical test

theory

The model indicates that any observed test (O) score could be

envisioned as the composite of two hypothetical components: a

true score (T) and a random error component (E).

Construct A single latent trait, characteristic, attribute, or dimension

assumed to be underlying a set of items.

Dichotomous Data that have only two values, such as right/wrong, pass/fail,

yes/no, agree/disagree, male/female.

Differential item

functioning

The loss of invariance of item estimates across testing

occasions. Differential item functioning is evidence of item

bias.

Item Individual question or statement that measures a single content

area.

Item bank A collection of items.

Item

characteristic

curve

A curve that describes the probability of response on an item

given a certain ability level.

Item response

theory

Mathematical models of how examinees at different ability

levels for a given trait should respond to a test item.

Item Response Theory: Applications of modern test theory in skin cancer research xv

Likert scale A series of statements or questions that is used to measure

people attitudes, behaviours, values, and opinions.

Measurement

error Inaccuracy resulting from a flaw in measuring instruments.

Measurement

precision The accuracy of any measurement.

Objective

measurement

The repetition of a unit amount that maintains its size, within an

allowable range of error, no matter which instrument intended

to measure the variable of interest is used, and no matter which

relevant person or thing is measured.

Partial credit

model

An item response theory model for polytomous data, which

allows the number of ordered item categories and/or their

threshold values to vary from item to item.

Polytomous An item having more than two response categories. For

example, a five-point Likert type scale.

Reliability A measure of the consistency of an instrument’s score over

time.

Scale Consists of multiple items that measure a single domain, such

as anxiety.

Standard error of

measurement

Describes an expected observed score fluctuation due to error in

the measurement tool. Standard deviation of error about an

estimated score. In classical test theory, the standard error of

measurement is the same for all score levels; in item response

theory it can vary from score to score, and therefore can be used

as a termination criterion in computer adaptive testing when the

number of items is allowed to vary until a threshold standard

error of measurement is reached.

Theta ()

Unobservable construct (or latent variable) being measured by a

scale. It is estimated from the responses people give to test

items that have been previously calibrated by an item response

theory model.

Threshold

The level at which the likelihood of failure to agree with or

endorse a given response category below the threshold turns to

the likelihood of agreeing with or endorsing the category above

the threshold.

Trait

An unobservable latent dimension, such as stress, well-being, or

pain, which is thought to give rise to a set of observed item

responses. In item response theory, the latent trait being

measured by a scale is denoted as theta (θ).

Unidimensional

A basic concept in scientific measurement that only one

attribute of an object be measured at a time. The item response

theory model requires a single construct to underlie the items

xvi Chapter 1: Introduction

that form a hierarchical continuum.

Validity

Refers to the degree to which evidence and theory support the

interpretations of test scores entailed by the proposed use of the

tests.

Item Response Theory: Applications of modern test theory in skin cancer research xvii

List of Publications and Presentations

THE FOLLOWING PAPERS HAVE BEEN PUBLISHED DURING MY

CANDIDATURE

Publications included in the thesis:

Djaja N, Youl P, Aitken J, Janda M. Evaluation of a skin self-examination attitude

scale using an item response theory model approach. Health and Quality of Life

Outcomes. 2014; 12(1): 189. doi: 10.1186/s12955-014-0189-x

Djaja N, Janda M, Lucas RM, Harrison SL, van der Mei I, Ebeling PR, Neale RE,

Whiteman DC, Nowak M, Kimlin MG. (Self-Reported Changes in Sun-Protection

Behaviours at Different Latitudes in Australia. Photochem Photobiol. 2016; 92: 495–

502. doi:10.1111/php.12582

Djaja N, Janda M, Olsen C M, Whiteman DC, Chien TW. Estimating Skin Cancer

Risk: Evaluating Mobile Computer-Adaptive Testing. J Med Internet Res. 2016;

18(1): e22. doi: 10.2196/jmir.4736.

Djaja N, Janda M, Olsen CM, Whiteman DC. Diagnostic Discrimination of the Skin

Cancer Risk (SCR) scale: Application of Item Response Theory. International

Outcome Measurement Conference. Chicago. 2015.

Djaja N, Youl P, Whiteman DC, White K, Kimlin M, Janda M. "Development and

Psychometrics Evaluation of Skin Cancer Risk Scale Utilising Item Response

Theory". (Draft). 2016.

Relevant publications (with QUT affiliation) not included in the thesis:

1. Kimlin JA, Black AA, Djaja N, Wood JM. Development and validation of a

vision and night driving questionnaire. Ophthalmic and Physiological Optics. 2016;

36(4), 465-476. doi:10.1111/opo.12307

xviii Chapter 1: Introduction

Conference publications during candidature:

1. Djaja N, Youl P, Aitken J, Janda, M. An Item Response Theory analysis of The

Skin Self-Examination Awareness Scale: An application of modern measurement

theory in skin cancer prevention research. Paper presented at the meeting of the

Pacific-Rim Objective Measurement Symposium, Kaohsiung, Taiwan. August 2013.

2. Djaja N, Youl P, Aitken J, Janda M. An Item Response Theory analysis of The

Skin Self-Examination Awareness Scale: An application of modern measurement

theory in skin cancer prevention research. Poster session presented at the meeting of

the Global Controversies and Advances in Skin Cancer. Brisbane, Australia. 2013.

3. Djaja N, Janda M, Olsen CM, Whiteman DC. Assessing the measurement quality

of a Skin Cancer Risk questionnaire using a Rasch modelling approach. Paper

presented at the meeting of the Pacific-Rim Objective Measurement Symposium,

Guangzhou, China. August 2014.

4. Djaja N, Janda M, Olsen CM, Whiteman DC. Assessing the measurement quality

of a Skin Cancer Risk questionnaire using a Rasch modelling approach. Paper

presented at the meeting of the International Objective Measurement Conference,

Chicago, IL. April 2015

5. Djaja N, Janda M, Olsen CM, Whiteman DC. Should we continue to measure skin

cancer risk factors using outdated methods? Paper presented at the meeting of the

3rd International Conference on UV and Skin Cancer Prevention, Melbourne,

Australia. December 2015.

Awards and grants during candidature:

1. Travel Grant Awards

a) Applied Psychological Measurement (USD 1,000)

2. Scholarships

a) Centre of Research Excellence in Sun and Health (CRESH) Scholarship.

b) QUT Tuition Fee Waiver Scholarship, Queensland University of Technology,

Australia.

c) Top-Up Scholarship, Queensland University of Technology, Australia.

QUT Verified Signature

xx Chapter 1: Introduction

Acknowledgements

I would like to take this opportunity to express my thanks to those who assisted me

with various aspects of conducting the research and writing of this thesis.

Special thanks to Professor Michael Kimlin, Director of Centre for Research

Excellence in Sun and Health (CRESH) and other CRESH investigators for their

support throughout my scholarship for my PhD. Thanks also to Applied

Psychological Measurement Inc. and Journal of Computer Adaptive Testing

(Minneapolis, U.S.A.) for the student travel grant that enabled me to present some of

my study at a conference.

I am grateful for my co-authors: 1). Dr Catherine Olsen and Professor David C

Whiteman from the QIMR Berghofer Medical Research Institute; 2). Associate

Professor Pip Youl and Professor Joanne Aitken from Cancer Council Queensland;

3). AusD investigators Professor Robyn M Lucas, Dr Simone L Harrison, Professor

Ingrid van der Mei, Professor Peter R Ebeling, Professor Dr Rachel E Neale, Dr

Madeline Nowak, and Professor Michael Kimlin; and 4). Dr Tsar-Wei Chien from

the Chimei Medical Centre.

Special thanks also to Director of Assessment Research Centre at The Hong Kong

Institute of Education, Professor Wang Wen Chung and Dr Tsar-Wei Chien from the

Chimei Medical Centre, Taiwan, for the internship opportunity to learn computer

adaptive testing.

I would also like to thank Dr Martin Reese who proofread this thesis during my

candidature. My thanks also to professional editor, Kylie Morris, who provided

copyediting and proofreading services of the non-published portions of this thesis,

according to university-endorsed guidelines and the Australian Standards for editing

research theses.

I want to thank my dear friends and colleagues at the iHop Research group, Linda

Finch, Anna Finnane, Benjamin Singh, Caitlin Horsham, Jena Buchan, Kelly

Prosser, Melissa Creed, Saira Sanjida, and Professor Sandi Hayes; and CRESH

students Lindsay Brandon, Shanchita Khan, and Huong Tran Cam Dang.

Item Response Theory: Applications of modern test theory in skin cancer research xxi

Last but not least, I especially wish to thank my supervisors, Professor Monika

Janda, Professor Michael Kimlin, and Professor David Whiteman. It is a privilege

simply to be associated with them, to learn from the best in the field. Words cannot

express my gratitude to them.

This thesis is dedicated to my partner and family in Indonesia.

Chapter 1: Introduction 1

Chapter 1: Introduction

BACKGROUND

Skin cancers include three common types: melanomas,1,2 basal cell carcinomas (BCC),3-5 and

squamous cell carcinomas (SCC). A large amount of research focuses on skin cancer

prevention, better understanding the risk factors for skin cancer, and improving early

detection and treatment of skin cancers. In the past five years, there has also been increasing

interest in vitamin D, a steroid hormone that requires exposure of the skin to the sun for

synthesis,6-10 which is hypothesised to be inversely associated with several types of

cancer.11,12 Australia is one of the countries that has led the research in both fields; this is

likely due to the high rates of skin cancer in Australia,13 and the unexpectedly high rates of

vitamin D deficiency, despite high ambient ultraviolet radiation.14,15

Melanoma, BCC, and SCC incidence rates all vary according to geographic locality, with the

highest rates in the northern parts of Australia,16 such as Queensland. On average, two out of

every three Australians will be treated for one of these three types of skin cancer at some

stage during their lives and it is estimated that approximately 80 percent of all new cancers

diagnosed in Australia are skin cancers.13 The diagnosis and treatment of non-melanoma skin

cancer (the collective term for BCC and SCC) was estimated to have cost the Australian

community $703 million (95% CI, $674.6–$731.4 million) in 2015, with an estimated

940,000 people receiving treatment for skin cancers each year.17

Many self-reported measures (questionnaires, surveys, rating scales, or interviews) have been

developed to assess melanoma risk or its components.18-20 Various terms are used in the

scientific community that have similar meaning to self-reported measure, such as

questionnaire,21-25 assessment,26-28 test,29-32 scale,33,34 survey,35-38 instrument,39 or inventor.40

All of these terms are generally used to refer to any procedure that aims to obtain self-

reported data in education, psychology, public health, and other fields by people giving

answers to questions in a paper-pencil form. In this thesis, these terms are used

interchangeably when referring to any procedure aiming to obtain self-reported data from

participants.

The questionnaires used in skin cancer-related studies are commonly designed for people to

self-complete. Broadly similar questionnaires that aim to assess risk, attitudinal, or

2 Chapter 1: Introduction

behavioural dimensions have been used in different geographical, cultural contexts and

among differing sub-populations.41-44 However, the questionnaires often appear not to have

been developed according to current psychometric standards, including assessment of

objectivity (clear meaning, understood the same way by different people or subgroups of the

population); validity, reliability, stability, or sensitivity to change; error of measurement;

norms; or score comparability (for those measuring the same construct).45 The few

questionnaires that reported a limited set of psychometric properties were developed using

traditional test theory (classical test theory).25,33,46,47 Only one study48 was found that used

more modern methods to assess questionnaire quality, called item response theory (IRT),

during measurement development. IRT is a group` of mathematical models suitable for the

assessment of the quality of a questionnaire that aims to capture constructs such as peoples’

attitudes, intentions, or self-reported behaviours.49 Attitudes, intentions, or self-reported

behaviours are also called latent traits, as they cannot be directly observed.

Figure 1.1-1 below helps to demonstrate how IRT works. Assume the line demarked with two

arrows on either side represents the skin cancer risk continuum, and that three items are used

to measure skin cancer risk. Items are placed in order of difficulty/severity. Easier items are

on the left and harder on the right. “Easy items” are those that are most likely to be answered

in the affirmative by people with low skin cancer risk and hard items are those that most

likely to be answered by people with high skin cancer risk only. The concept of easy-hard

items is similar to the concept of relative risk or risk ratio in epidemiology; a “hard” item will

have a larger risk ratio compared to an “easy” item. An example of a polytomous item with

five category options measuring frequency of sunscreen usage: never, rarely, sometimes,

often and always. A person who choses never will have a higher risk ratio compared to a

people who selects always as their answer. In IRT, the answers that people give to questions

will place them along the unobservable latent trait continuum depending on their

characteristics. For example, Figure 1.1 displays a person with low skin cancer risk called

Alex. Hypothetically, he should have dark hair, and be an indoor worker who has not had

sunburn in the last five years. Whereas Bob, a person with high skin cancer risk, is

hypothesised to have with ancestors from Ireland, light skin, blue eyes, and many freckles.

This illustrates one of the main benefits of IRT, that for each item, and indeed each answer

category of each item, one should know where on the underlying skin cancer risk they

measure.


Figure 1.1-1: Illustration of IRT

As mentioned above, Queensland has the highest incidence of melanoma in the world,13,50-52

and is ideally placed to study the natural history, risk factors, and treatment patterns of skin

cancer. Research conducted in this area of high incidence, and therefore interest in skin

cancer more broadly, provides an ideal opportunity for initial item validation and assessment

of the psychometrics properties of new measures and comparison with other skin cancer-

related measures previously used in Queensland, Australia, and worldwide.

BRIEF LITERATURE REVIEW

1.2.1 Skin cancer-related measures

Sun exposure, such as sunbathing, sun bed use, and other types of exposure to ultraviolet

radiation (UVR), as well as resulting sunburn are the major preventable risk factors for skin

cancer.12,53-56 Many studies have been conducted worldwide in an attempt to gather detailed

information regarding what contributes to skin cancer risk or protection, including sun

exposure behaviours, sun protection behaviours, and knowledge and concern about vitamin

D.8,10,13,57-59

Self-reported measures have been used frequently in those studies as the preferred method to

obtain skin cancer risk factor information, as they are considered more convenient and less

burdensome than other methods (sun diary, sunscreen swabbing, direct observation, or UV

personal dosimeter). This is evidenced in Table 1.1, which summarises the studies conducted

during the last 10 years that assessed psychometric properties of skin cancer-related

measures. These studies,24,46,47,53 attempted to develop measures to assess skin cancer risk,


solar UVR exposure, or sun protection behaviour using classical psychometric theory. The

results in Table 1.1 show that only one study applied an IRT approach to assess the

psychometrics properties of their measures.48

However, previous research has at least two major limitations. First, the majority of studies

were not comprehensive, and only focused on a few aspects of phenotype, sun exposure, or

sun protection behaviours, often due to time or budget constraints.39 Second, time and effort

spent on measuring design and development was often limited, and little methodological

research has been conducted to ascertain the appropriate comprehensive validity and

reliability testing of the questions used in skin cancer research studies.60

Bränström et al60 and Horsburgh-McLeod et al48 are two examples of the few studies33,53,61-63

that have focused more strongly on the psychometrics properties in their work. Bränström et

al60 investigated the stability (test-retest reliability) of measuring behaviours and attitudes

related to sun exposure. They found that items assessing people’s skin type and tendency to

burn showed moderate stability (Kw =0.67 to 0.81), while items assessing self-efficacy and

risk perception with regards to sunbathing were less stable (Kw = 0.40 to 0.73). Horsburgh-

McLeod et al48 applied IRT to a suntan attitude scale. They examined attitudes toward sun-

tanning among 6,200 New Zealand adults (15-69 years) using seven items to be answered on

five-point Likert-type scales. In this study, they used Rasch rating scale models to assess the

construct validity of a scale on attitudes towards sun tanning. Based on their results, the scale

had acceptable fit with the Rasch model (infit and outfit statistics (0.6-1.4)),64 with the

exception of one item (infit =1.96; outfit=2.20), which fell outside the acceptable range.

Although IRT has not been used extensively during questionnaire development for the

assessment of skin cancer risk, it is widely used in other areas. Several large scale projects are

currently underway that have developed large databases of IRT tested item banks. Examples

include the Patient-Reported Outcome Measurement Information System,65 Programme for

International Student Assessment,66 and Trends in International Mathematics and Science

Study.67 IRT is increasingly used in the development of scales in other areas of health

research, for example: (i) psychological scales such as: the Postpartum Depression Screening

Scale,68 the Hamilton Depression Rating Scale,69 Positive and Negative Syndrome Scale,70

and NEO five factor inventory;71 (ii) selection tests such as Test of English as a Foreign

Language,72 Graduate Record Examination,73 Graduate Management Admission Test;74,75 and

(iii) licensure examinations such as the national nurse licensure examinations76 and National

Board of Medical Examiners tests.77


Table 1.1: Studies examining the psychometrics properties of skin cancer related measures

No Author Measured variable(s) Methods No of items Psychometrics analysis performed

1 Oh et al., 2004221 Sun protection -Self-report 43 items -Inter-rater agreement (Kappa)

0.76 to 0.97

2 Tripp et al., 200333 Sun protection -Self-report 42 items -Confirmatory Factor Analysis

- Sunscreen-use behavioural scale

(CFI = 0.94; GFI = 0.93)

-Sun-avoidance scale (CFI = 0.91;

GFI = 0.98)

3 Glanz, et al., 200946 Sunscreen use -Self-report

-Diary

-Swabbing

N/A -Inter-rater agreement (Kappa)

Children : 0.40

Lifeguards : 0.34

Parents : 0.27

4 Jennings, et al., 201253

- Sun exposure

- Sun protective

practices

-Self-report 15 items -Test retest reliability (Kappa):

0.35 to 1.00

-Construct validation (logistic

regression)

-Internal consistency (Cronbach

alpha) : 0.77 to 0.80

5 Bränström, et al., 200260 Behaviours and attitudes

toward sun exposure

-Self-report 33 items -Test retest reliability (Kappa and

Pearson) : 0.81, 0.88 and 0.71

6 Dusza, Oliveria, Geller,

Marghoob, & Halpern,

200562

- Sun exposure

- Sun protective

practices

-Self-report 10 items -Kappa 0.52 to 0.73

7 Horsburgh-McLeod, et al.,

2010 48

Attitude toward suntan -Self-report 7 items -Internal consistency (Cronbach

alpha) : 0.77

-Validity (Spearman & Pearson)

-Rasch

-2- PL IRT models

-3-PL IRT models.

8 Morze et al., 2012 63

-Phenotypic

characteristics

-Sun exposure

-Self-report

37 items -Intraclass correlation coefficient

-Kappa : 0.87


No Author Measured variable(s) -Methods No of items Analysis

9 O'Riordan, Glanz, Gies, &

Elliott, 2008217

-Sun exposure

-Sun protective practices

-Self-report

-Sunscreen swabbing

-Direct observation

-Diary

-Polysulphone dosimeters

N/A -Kappa 0.21 to 0.72

-Anova

10 Cargill et al., 2013 165 -Sun exposure

-Skin pigmentation

-Sun diary

-UV dosimeter

N/A Correlation

11 Thieden, Philipsen, &Wulf,

2006 246

-Sun exposure -Sun diary

-UV dosimeter

N/A -Mann-Whitney U

-Wilcoxon

12 Hedges &Scriven, 201047 -Attitude and behaviour

to sun protection

-Interview

-Observation

N/A -Chi-square

13 Humayun et al., 201224 -Sun exposure -Questionnaire (interviewer

administered and self-

administered)

-UV dosimeter

N/A -Correlation

14 Detert, Hedlund, Anderson,

Rodvall, Festin, Whiteman,

Falk 61

-Sun exposure and

protection index (SEPI).

-Readiness to Alter Sun

Protective Behaviour

questionnaire (RASP-B)

-Questionnaire SEPI :

8 + 5 items

RASP-B :

12 items

-Cronbach Alpha (0.69 – 0.73)


1.2.2 Brief overview of Test Theory

The history of classical test theory (CTT) began in the early 20th century, when in

1904 Charles Spearman demonstrated how to obtain an index of reliability to correct

a correlation coefficient for attenuation due to measurement error.78 This formula

later become known as Spearman–Brown prophecy formula.79 The CTT model has

been used for around a century, as the model is simple and many researchers know

the basic terms and procedures, which makes classical test theory easy to apply and

interpret.

CTT is based on the proposition that the observed score is composed of a true score

(called a latent variable) and a measurement error, it postulates that if the

measurement error is zero then the observed score should be equal to the true score.79

CTT attempts to estimate the true score by reducing the measurement error of a test

as a whole. CTT contributes to the science of test development by providing a

framework to assess some aspects of the quality of measurement (e.g. test reliability).

Although CTT has been used extensively in test development and assessment of

quality of measurement, the method has shortcomings; CTT has weak assumptions

regarding its framework80 In CTT, the individual ability parameter is dependent on a

given question tested under a given situation, and the item difficulty parameter also

depends on a specific group of participants assumed to be a representative sample of

a given population. The inherent characteristics of sample dependence and item

dependence in CTT makes it impossible to predict an individuals’ response to an

item unless that item has been previously administered to similar individuals.81 Other

limitations include; (a) an error estimate that is assumed to be constant (common)

across all raw scores; and (b) due to the focus on the overall test score, no model

(theory) that allows the prediction of probability of success or failure (or probability

to endorse) of a given item by an individual with a given ability (latent construct)

estimate.82 These limitations make it unsuitable to implement in advance

psychometrics such as computer adaptive testing or test equating (more details

regarding applications of these methods is given in pages 19-21), which require

detailed information about each individual item’s performance.

IRT was first proposed by George Rasch in Denmark and Birnbaum in the United

States83 to overcome the limitations of CTT. It provides an alternative to CTT by

proposing a different approach in constructing new tests, modelling existing


scales/tests, interpreting the results of an assessment, and the quality of

measurement.80,84 IRT is a family of mathematical models that describe the

probability of a person answering a certain question as a function of a person’s

position on the latent trait plus one or more parameters (1 parameter to 3 parameter

logistic models are described on pages 25 to 28) characterising that particular item.85

The history of IRT can be traced back to the work of Binet (1905) and Thurstone

(1925).86 For a brief history of IRT and a review of the reason for the transition from

CTT, see Baker,87 Bock,86 and Hambleton.82

1.2.3 Differences between Classical Test Theory and Item Response

Theory

Classical test theory focuses on the analysis of the total score keeping all items in a

predetermined order to retain the reliability of a whole test and use the frequency of

correct responses to indicate item difficulty (see page 7 for detail).79 Compared with

classical psychometrics, IRT has several advantages.88,89 Below are some of the key

differences between classical test theory and IRT:

1. Model: In CTT, the correlation between the number of items answered

and the underlying construct is assumed to be linear. IRT assumes the

model is nonlinear. As a consequence, the mathematical equations

describing the association between a respondent’s underlying level on a

latent trait and the probability of a particular item response follows a

nonlinear monotonic function.90 The correspondence between the

predicted responses to an item and the latent trait is known as the item

characteristic curve; more detail on the item characteristic curve is given

on pages 12-15. The numbers of item parameter(s) considered in IRT

models depends on the model being used, as explained on pages 25 - 30.

2. Level of analysis: The advantages of IRT over CTT are specifically at the

item level.91 CTT usually focusses on the test as a whole. Internal

reliability indices (such as Cronbach alpha) are most frequently reported as

an indicator of test quality overall. Cronbach alpha indices can also be

used to indicate which items can safely be deleted without compromising

the reliability index, although this is rarely used in practice. In IRT, a

greater number of item characteristics are considered, including whether:

1) items fit with a particular model (1 parameter logistic model, 2


parameter logistic model, 3 parameter logistic model, partial credit model

(PCM), rating scale model (RSM) or other); and 2) items advantages or

disadvantages certain groups of people (differential item functioning/DIF).

DIF analysis refers to differences in the way a test item functions across

different subgroups of participants (e.g. male and female) that are matched

(equal) on the attribute measured by the test.92 Consideration of DIF is

important to establish the adequacy of each question for use in diverse

populations.93 In the context of skin cancer and sun protection behaviour,

persons of different ages, education levels, and genders who have equal

levels of sun protection should be equally likely to endorse a particular

category of a specific sun protection item. For example, males and females

who are equal in their levels of sun protection should be equally likely to

respond ‘‘yes” to the item: “Do you routinely apply sunscreen, including

moisturisers or makeup with a sun protective factor, regardless of whether

or not you are going out in the sun?”. However, the literature shows that

males and females differ in their sun protection behaviour, especially in

sunscreen use,94 and differential item functioning may be expected on

items similar to this example. During item development, care must

therefore be taken to construct well performing items, which allow people

with the same risk (in item test theory usually termed “ability” or

symbolised with the Greek letter: ) to not differ according to their gender

or other characteristics not associated with the latent construct. Ideally, the

probabilities of endorsing a specified question responses should be

independent of subgroup membership.95,96 To illustrate this, Figure 1.1-2

shows the item characteristic curve (explained in more detail on pages 12-

14) of two subgroups of participants. The blue line represents the reference

group (outdoor workers) and the red line represents the focal group (indoor

workers). The blue line always remains above the red line, which means

that outdoor workers always have a higher probability of getting skin

cancer compared to indoor workers (in this particular item), regardless of

their skin cancer risk. When items function similarly across demographic

groups (do not exhibit differential item functioning), direct comparison of

group scores is justified. If an item is differentially more difficult to


endorse for an identifiable subgroup, the item may be measuring

something different from the intended construct, at least in one of the

groups. As a result, DIF statistics are used to identify potential sources of

item bias.97-99 Subsequent review by subject matter experts and bias

committees is required to determine and resolve the source of attitude or

behaviour differences. Items with high DIF either need to be changed,

dropped or split for each group.

There are two types of differential item functioning: the first is called

“uniform differential item functioning”,92 where differential item

functioning is consistent across the range of the domain being measured.

Figure 1.1-2 shows a uniform DIF where the reference group is favoured

at all levels.

Figure 1.1-2: Uniform differential item functioning

The second type of DIF is “non-uniform differential item functioning”,92

where its impact can vary at different levels of the construct being

measured. As shown in Figure 1.1-3, in a non-uniform DIF curve, the focal

group is favoured at low theta (skin cancer risk construct); however, the

reference group is favoured at high theta. This means males have a lower

probability of getting skin cancer compared to females at low theta, but

they will have higher probability of getting skin cancer compared to

female at high theta.


Figure 1.1-3: Non-uniform differential item functioning

3. Model assumptions: According to Hambleton & Jones80 the CTT model

represents a group of weak theoretical assumptions, as it is easy to apply in

many test constructions and test utilisations,100,101 The assumptions in the

CTT are that: (a) true score and error scores are uncorrelated, (b) the

average error score in the population of examinees is zero, and (c) error

scores on parallel tests are uncorrelated.80,102 In contrast, IRT models are

referred to as strong models,103 as the underlying assumptions are strict,

and therefore less likely to be met by test data. Most applications of IRT

assume unidimensionality of the latent construct being measured, and all

models require local independence of each item.104 Unidimensionality

means that only one underlying construct is measured by the items in a

scale. Local independence means that the items are not highly correlated

with each other once the latent trait has been controlled for. In other

words, local independence is obtained when the complete latent trait space

is specified in the model.84,105 If the assumption of unidimensionality

holds, then only the underlying latent trait is influencing item responses

and local independence is obtained.

4. Item-ability (response) relationship: The relationship between item and

ability (response) in CTT is not specified; however, in item response

models, this relationship must follow a specific item response function. In

CTT, the relationship between item and ability (item characteristics) might

change depending on the population administered a questionnaire.80,106


Borrowing an example from educational assessment, if a high-ability

subpopulation (e.g. high-achieving students) answered a test, all items

would appear to be easy. On the other hand, when a low-ability

subpopulation (e.g. low-achieving students) is considered, the same set of

items would be classed as “difficult”. This limitation makes it difficult to

assess individuals’ abilities by using different test-forms. The terms “easy

item”, “hard item” and “item difficulty”, which are used frequently

throughout this thesis, stem from the educational assessment literature,

where IRT was developed, and are commonly used.107 “Easy items” are

items that are most likely answered correctly by most participants and

therefore are most useful in determining the people who have lower

abilities. Meanwhile, “difficult items” are most likely answered correctly

by a small number of high performing individuals and are useful in

determining people who has high abilities. In health-related studies, easy

items are the items that most likely answered in the affirmative by most

participants, and hard items are the items most likely answered by people

with certain characteristics, for example, only those with high skin cancer

risk.

5. The central concern in IRT is the relationship between the latent construct

(trait) being measured and the probabilities of respondents endorsing each

of the item’s response categories. In order to show this relationship, an

item response function can be drawn, called as an item characteristics

curve.

Figure 1.1-4 below illustrates the item characteristics curves for two

dichotomous items measuring sun protection behaviours. This example

uses the 3-parameter logistic (3-PL) model item response function

discussed further on page 28. The horizontal axis is the underlying

construct being measured in a logit scale, in this example, it is assumed

that this is sun protection behaviour. A positive score means better sun

protection behaviour. In educational assessment (the origin of IRT), the Y

axis represents the probability of answering the item correctly; however, in

other fields such as psychology and health, y represents the probability of

endorsing an item. Each plot (item characteristic curve) represents the


models’ prediction of the probability of answering “Yes” to an item about

sun protection behaviour. The figure shows that the probability of

answering yes to item 1 (use sunscreen) is higher than to item 2 (wear a

hat). Item 1 can be said to be the easier item to be endorsed, as the

probability of answering yes to the question is already high for people who

are low in their overall sun protection behaviour. In contrast, item 2 (wear

a hat) can be said to be harder item to be endorsed, as only people with

high sun protection behaviours are likely to say yes to that item.

Figure 1.1-4: Item Characteristic Curve from 2 dichotomous items

It can also be seen that item 1 (use sunscreen) is more informative for

people with poor (low) sun protective behaviour, and item 2 (wear a hat) is

the more informative item for people with good (high) sun protective

behaviour. This information about items’ measurement location allows the

administration of items with maximum information only for particular

groups of people, and is commonly applied in computer adaptive testing,

discussed on pages 19-20 and in Study 3 (page 82).

For polytomous items such as Likert scales, the item characteristic curves

are more complex. In polytomous IRT models, the response function is

called a category response function. In Figure 1.1-5, each curved line

represents the model’s estimate of the probability of performing each

category of a given activity according to overall sun protection probability.

The horizontal axis is the scaled score of sun protection behaviour on a T

scale (a standard with a mean of 50 and a standard deviation of 10). A


higher score means better sun protection behaviour. The vertical axis is the

probability of endorsing a category, ranging from 0 to 1.

How often did you apply sunscreen during the past year?

Figure 1.1-5: Category response function of a polytomous item

Item-ability relationships can also be observed by plotting item

information functions, as shown in Figure 1.1-6. Each item provides

information at different trait () levels. In Figure 1.1-6, each curved line

represents the item information function for one item. Each item carries

different information at different trait theta () levels. For example, item 2

(pink line) provides the most information for people with around 1; in

contrast item 5 provides the most information for people with around 2.

It is useful to include a range of items in a scale to ascertain coverage of a

wide . CTT cannot provide such detailed information about the optimal

measurement location for each item.77

Low High

Sun protection behaviour


Figure 1.1-6: Item information functions for 8 items

6. Ability: In CTT, ability () or test scores are often reported on the test-

score scale (or a transformed test score scale such as T scale, Stanine, etc.)

and usually calculated by adding up the score from each item for an

overall average. Every person must answer the same items and complete

all of them in the same order to allow comparison of scores.79,80 In

contrast, in IRT, ability scores are reported as theta () scores ranging

from – to + (or a transformed scale). Theta scores can still be

compared, even when people answered different items, as long those items

were calibrated on the same scale and their item information function is

known.49,84,108,109 This leads to the next point: invariance of item and

person statistics.

7. Invariance of item and person statistics (only applies to the Rasch

model): In CTT, item and person parameters are sample dependent.80,84

This means the item difficulties are dependent on the ability of the sample

answering the questions, and the score of a person also depends on the

number of items and item difficulties of items answered by that person.

For example, as the formula to calculate item difficulty is the number of

persons answering the item correctly divided by total number of

participants,102 the same items will have different item difficulties if

administered to a group of people who like to sunbathe compared to sun


avoiders. In Rasch models, the item and person parameter are sample

independent (their position on the latent trait can be estimated by any items

with known item response functions, and item characteristics are

population-independent within a linear transformation),80,84,101 if the test

data fits the model. This means that the person’s ability (score) is not

dependent on the particular item being answered (administered). A person

can answer any combination of items from the item bank and should still

receive an equivalent score.

8. Item statistics: p (item difficulty) and r (item total correlation) are usually

reported in CTT.80,102 Item difficulty is the proportion of people endorsing

an item, or the prevalence of exposure, and is dependent on sample

calibration (see point 6 above). In item response models, b (item difficulty

parameter), a (item discriminant parameter), and c (pseudo-guessing

parameter), plus the corresponding item information functions are

reported,87,104 (as described on page 28).

9. Sample size requirements: CTT usually requires a sample of 200 to 500

to allow adequate item parameter estimation;80 however, due to the greater

number of parameters, item response models require larger samples

(generally over 500) depending on the model being used.80 More complex

models will require even larger samples to allow robust item parameter

estimation104 This is one of the drawbacks of IRT; recruiting a large

sample can be challenging, especially in clinical research.110

10. Assumption of equal response category distance: In CTT, the distances

between successive response categories are assumed to be equal, while in

IRT the distances between successive response categories are derived from

the data.111 For instance, the distance between “Strongly Disagree” to

“Disagree” may not be the same as the distance between “Agree” to

“Strongly Agree”. Consider a four-point scale on which an individual is

asked to indicate their lifetime sunburn with 1 (never), 2 (seldom), 3

(sometimes) and 4 (often) as the possible answer categories. An individual

who was never exposed long enough to get sunburn would likely indicate a

very low frequency of sunburn (e.g., a response of 1). As the individual’s

frequency and duration of sun exposure increases past a threshold values,


he/she will likely endorse the next highest category. IRT (especially rating

scale models) does not assume that these steps (threshold) are equally

spaced. In other words, a relatively small increase in sun exposure duration

and frequency may underlie a person to cross the threshold from choosing

the ‘never’ rather than ‘seldom’ category, but a much larger increase may

be required for a person to choose ‘sometimes’ compare to ‘often’. IRT

specifically assesses these distances,112-114 (example displayed in Figure

1.1-5).

11. Standard Error of Measurement: In CTT, which assumes that the

standard error of measurement (SEM = s√(1 − 𝑟) ) is constant across all

levels of ability (latent trait),106,115 people with low, medium, or high

ability are assigned the same standard error of measurement. In contrast, as

shown in Figure 1.1-7, the SEM in IRT is conditional (dependent) on a

person’s ability. This conditional SEM is an inverted function of the test

information function, and estimates the amount of error in theta estimation

for each level of theta. Summed across the items, the conditional SEM

provides a useful index of the amount of measurement error from a test.115

Usually the conditional SEM will be high at both ends of a test score and

low in the middle area; this is because the conditional SEM will increase

in parallel to the decrease of test information. Usually both ends of the

continuum of the item information function have the least information.

Assume Figure 1.1-7 measures skin cancer risk and people can score

between -3 to 3 theta. IRT shows that the standard error of measurement of

the scale is much greater between -3 and -1 and also between +2 and +3,

but is small between -1 and 2. The item information function in Figure

1.1-7 shows that the highest level of information (or certainty in the

estimate) is in regards to people who score between -1 and +2. The

conditional SEM provides a good summary overview for a questionnaire

developer, clearly indicating where additional items require development,

and also allows the creation of a tailored test with a pre-specified

acceptable measurement error.80,115


Figure 1.1-7: Standard error of measurement

12. Number of response categories: In CTT, responses to several Likert-like

items can only be summed providing that all of the items use the same

Likert scale (e.g: a four-point Likert scale and six-point Likert scales

cannot be mixed to calculate an average).79,102,116 However in IRT, certain

models, as described on page 29, allow the use of different response

categories in the same scale.106 This is important, as a questionnaire

sometimes consists of items measuring the same underlying construct on

different Likert scales.

13. Detection of redundant item using item statistics: IRT models can

detect redundant items from their item parameters,83 in contrast with item

statistics of CTT, which only provide information regarding how strong

the correlation between the item and total score is, or how well the item

discriminates between people with low or high levels of the construct

being measured.79 Items that have the same item parameter location (b

parameter) on the latent trait continuum can be interpreted as measuring

the same level of latent construct and seen as redundant. In developing an

item bank, these redundant items can be useful as alternative items in

parallel forms or to enhance the items available for CAT.

Standard Error Measurement

Test Information Function


14. Test administration: Most currently used health-assessment

questionnaires are based on CTT.117 Therefore, all items or questions must

be administered to every person tested in order to retain the validity of the

scale. Missing items or responses will be an issue when determining the

total score of an individual.118 An advantage of IRT over CTT is the ability

to administer the test using computer adaptive testing (CAT). This mode of

test administration enables shortening of the test or use of a tailored test

(where individuals may receive different items targeted to their specific

health risk level).119,120 CAT-based item administration results in shorter

assessments without the trade-off of losing measurement precision.121,122

However, a fully functioning CAT requires a large, calibrated item

bank,123 and is considered costly to develop.

Another important component related to test administration using CAT

methods is the stopping rule. One of most common stopping rules used is

the minimum SEM required.124 As discussed earlier, in IRT the SEM can

vary across different trait levels.125 Figure 1.1-8 illustrates the use of a

conditional SEM as one of the stopping rules when using CAT for

diagnostic purposes. The red line represents the cut-off score, the black

squares represent the participant’s latent trait estimate, and black vertical

lines represent the confidence intervals around these measurements. For

example, the first item in a CAT measuring skin cancer risk is the

following: “Thinking about ALL of the times when you were outside in the

sun during the past year, how often did you apply sunscreen?” Assuming

that this person answers the first question by selecting the “Always”

category, this will result in their estimated skin cancer risk being

calculated as below average. As the person answers more and more

questions, the confidence interval around the point estimate for this

person’s skin cancer risk gets smaller as the risk estimate becomes more

precise. Around item 30, the confidence interval is clearly below the cut-

off score, but further questions are still being presented to cover all

content. In CAT, other stopping rules could be selected, for example,

desired confidence interval, minimum length or maximum length of

testing, or the run out of time rule.126 In many operational CAT programs,


a combination of two or more stopping (test termination) rules is used,

usually a minimum standard error criterion and maximum test length. The

maximum test length serves to ensure that the entire item bank is not

administered.124

Figure 1.1-8: Illustration of computer adaptive testing

15. Test linking and equating: If two or more questionnaires measure the

same concepts, it can be advantageous to link or equate them.127,128 The

objective of test linking is to establish a common reporting metric between

two or more tests that allows for the prediction of success (or endorsing in

non-cognitive items) on construct-linked items.129,130 Examples of recent

linking studies in patient reported outcome research are: 1. Linking

between the physical and mental health scores on the Veterans RAND 12-

Item Health Survey and the Patient Reported Outcomes Measurement

Information System Global Health scores,131 and 2. Linking between NIH

Patient Reported Outcomes Measurement Information System Physical

Function item bank and the Short Form-36 physical function ten-item PF

scale.132 Only traditional equating methods (linear linking and

equipercentile equating) can be performed under a CTT approach. This

can be a limitation, as these methods require that the same person

completes both tests.133 Item response models non-linear linking and

equating does not require the person to complete both tests, it only requires

Stop evaluation

High Risk

RiskRisDepressio

Low Risk

Depression


some common items (as anchor items for calibration) in each of the test to

be completed.128 Although this method is frequently used in educational

assessment, to date no single study in skin cancer-research has applied this

method.

Figure 1.1-9 demonstrates how test linking with single group design in

skin cancer research could be performed; three scales that measure aspects

of the same underlying latent construct (e.g. sun exposure behaviours) are

completed by participants.128 Using IRT approaches, these scales can be

linked and put in the same common scale, as long as they have some

common items. This means once these linking functions are established,

scores from one scale can be converted to another.

Figure 1.1-9: Sample of test linking


Table 1.2 below provides a summary of methodological distinctions between CTT

and IRT.80

Table 1.2: Differences between Classical and Item Response Theories

Area Classical test theory Item response theory

1. Model Linear Non linear

2. Level Test Item

3. Assumptions Weak (i.e., easy to meet with

test data)

Strong (i.e., difficult to meet with test data)

4. Item-ability relationship Not specified Item characteristics function

5. Ability Test scores or estimated true

scores are reported on the test-

score scale (or a transformed

test-score scale)

Ability scores are reported on the scale – to

+ (or a transformed scale)

6. Invariance of item and

person statistics

No – item and person

parameters are sample

dependent

Yes – item and person parameters are sample

independent, if model fits the test data

7. Item statistics p (item difficulty)

the item discrimination index

(item total correlation)

b, a, and c (for three-parameter model) plus

corresponding item information functions*

8. Sample size (for item

parameter estimation)

200 to 500 (in general) Depends on IRT model but larger samples,

i.e., over 500, in general, are needed

9. Assumptions of item/

response categories distance

Equivalence Not equivalence and has true interval-level

10. Standard error of

measurement (SEM)

Constant across ability Conditional on person ability

11. Number of response

categories

Same response categories Possible to use different response categories

in the same scale

12. Detection of redundant

item using item statistics

Not possible Possible

13. Test/scale administration Need to administer whole

items

Possible to administer shorter test or tailored

test (Computer Adaptive Test)

14. Test linking and equating Traditional equating methods:

Linear linking and

equipercentile equating

Non-linear equating: IRT linking and IRT

equating

*a = item difficulty, b = item discrimination, c = item guessing


Given this summary it could be beneficial to apply an IRT approach for the

evaluation of psychometric properties of questionnaires used in skin cancer research.

In addition, there would be merit to developing a scale measuring skin cancer risk

(phenotype, sun exposure behaviours, and sun protection behaviours) using modern

test theory. By using IRT models, a precise measurement with small standard errors

could be achieved, requiring fewer items to be completed, thus significantly reducing

the participants’ burden and the potential sample size required in future studies.134-136

1.2.4 Test Development using Classical Test Theory and Item

Response Theory

In test development processes, both CTT and IRT have similar and different steps.

Table 1.3 displays the typical steps in test development. Important differences

between test development using classical and item response measurement theories

occur at steps 3, 5, and 9 80.

Table 1.3: Steps in test development

Steps in test development

Step 1 Preparation of test specifications

Step 2 Preparation of the item pool

Step 3* Field testing the items

Step 4 Revision of the test items

Step 5* Test development

Step 6 Pilot testing

Step 7 Final test development

Step 8 Test administration (for norming and technical data)

Step 9 * Technical analyses (e.g., compiling norms, standard setting, and equating

scores)

Step 10 Preparation of administrative instructions and technical manual

Step 11 Printing and distribution of tests and manuals

In step 3, test developers applying CTT are concerned about the representativeness of

the overall population for whom the test is intended.137,138 They are using simple

mathematical techniques, a moderate sample size and heterogeneous samples to


achieve higher estimates of item discrimination indices as measured by biserial or

point-biserial correlation coefficients.79 In contrast, the test developer applying IRT

requires complex mathematical techniques and large sample sizes.

In step 5, CTT items are selected in the test based on two indices: item difficulty

(prevalence of exposure) and item discrimination. An item with too high or too low

an item difficulty is considered a poor item and must be removed from the test to

maximise discriminations among all test takers. Meanwhile in IRT, items are

selected based on goodness-of-fit criteria to detect those that do not fit the specified

response model. Test developers can determine the contribution of each test item to

the test the information function independently of other items in the test. This means

that test developers can create multiple forms of a test to maximise test information

targeted at specific regions of latent construct.81,109,139 Items at either extreme of the

latent construct may still be valuable, even if they only are relevant to a few people

and help to discriminate them.

In step 9, a test developer using CTT will typically compile norms based on specific

demographic information, such as gender and age. A person’s score must be

compared against the performance results of a selected group of participants who

have already taken the test,79,140 In contrast, in IRT, a person’s score is usually

interpreted in regards to their level of proficiency (in achievement test) or severity (in

health-related test) and the cut scores corresponding to those levels. This process is

known as standard setting,141,142

In summary, the major differences in test development using CTT and IRT are in

item calibration, selection, and scoring processes 80.

Popular models in item response theory

More than ten IRT models143 have currently been defined, as shown in Table 1.4;

however, only a few models are frequently implemented in applied research. Those

models are the Rasch model/1-parameter logistic model, 2-parameter logistic model,

3-parameter logistic model, rating scale model, and partial credit model. Each of

those frequently used models is discussed in the following paragraphs.


Table 1.4: Taxonomy of IRT Models

Dichotomous Data Polytomous Data

1-Parameter Logistic Model /Rasch model(1-

PLM)

Rating Scale Model (RSM)

2-Parameter Logistic Model /Birnbaum

model(2-PLM)

Partial Credit Model (PCM)

3-Parameter Logistic Model (3-PLM) Generalised Partial Credit Model (G-

PCM)

4-Parameter Logistic Model (4-PLM) One-Parameter Logistic Model for

polytomous items (OPLM-po)

One-Parameter Logistic Model with Imputed

Slopes (OPLM)

Rating Scale version of the Graded

Response Model (RS-GRM)

1-Parameter Normal Ogive Model (1-PNOM) Graded Response Model (GRM)

2-Parameter Normal Ogive Model (2-PNOM) Model of Monotone Homogeneity for

polytomous items (MHM-po)

3-Parameter Normal Ogive Model (3-PNOM) Weak Double Monotonicity Model

(WEAK DMM)

Model of Monotic Homogeneity for

Dichotomous Data (MHM-di)

Strong Double Monotonicity Model

(STRONG DMM)

Model of Double Monotonicity for

Dichotomous Data (DMM-di)

Isotonic Ordinal Probabilistic Model

(ISOP)

1.2.5 1-Parameter Logistic (1-PL) Model or The Rasch Model

The simplest, and one of the most widely used IRT models is the 1-parameter logistic

model (1-PL model). It is also called the Rasch model. Despite the Rasch model

being derived from the initial Poisson model and its conceptual differences with one-

parameter logistic model, for most practical purposes these models are identical. In

the 1950s, the Danish mathematician, Georg Rasch developed this model for reading

tests and a model for intelligence and achievement tests, which is called the Rasch

model. It is so called one parameter model because this model is only concerned with

a single item parameter (i.e., item difficulty (b) parameter). Under the Rasch model,

both guessing and discrimination are negligible or constant. It predicts the

probability of a response to an item based on the interaction between item difficulty

and individual ability 144. The 1-PL model can be expressed by equation 1 below 144.

Equation 1

Where Pi () is the probability (e) of an individual with a given ability theta ()

correctly answering (or endorsing) a particular item with a difficulty level (b

parameter). This probability represents the interaction between the person’s ability

( )

( )( )

1

i

i

b

i b

eP

e


and the item difficulty. An item (question) difficulty (threshold) represents the

position in logits that the item occupies on the linear skin cancer risk scale.145 It

could be described visually in an item characteristics curve (also called item response

functions or trace lines), as explained previously. Figure 1.1-10 presents an example

of item characteristics curves for the 1-PL model. The four curves in the figure

represent four items with different difficulties. It can be seen that for a person with a

given ability between a range between -3 and +3, the probability of getting the

answer correct is only determined by each item’s difficulty or location on the latent

construct continuum (b1= -2, b2= -1, b3=1, and b3=2).

Figure 1.1-10: One-Parameter Item Characteristic Curves for Four Typical Items

1.2.6 2-Parameter Logistic (2-PL) Model

An extension of the 1-PL model is the two-parameter logistic model (2-PL) model.

This model was originally developed by Lord146 based on the normal ogive function

(the curve of a cumulative distribution function) function and Birnbaum 147 then

proposed a logistic function as a simpler alternative, because calculation using

normal ogives function was considered too computationally demanding for the

computers in that era (1960s).

In addition to item difficulty (b parameter), this model incorporates the item

discrimination parameter (usually denoted a) as a second item parameter. Item

discrimination (a parameter) is represented by the slope of the item characteristics


curve. The usual range for item discrimination parameters is (0,2). High values of a

parameter result in very “steep” item characteristics curves and such items are more

discriminating than items with flatter curves. The equation for 2-PL model 144 is

Equation 2

Where the Pi () and bi are defined just as in Equation 1. The D factor is a constant

(scaling) factor to make the logistic function as close as possible to the normal ogive

function. The difference in Pi () for the two parameter ogive function and 2-

parameter logistic function is less than 0.01 when D=1.7 (D is a constant). The

second additional element in the two parameter model is the parameter ai, which is

the item discrimination parameter (slope). The relevant item characteristics curves in

Figure 1.1-11 facilitate understanding of the 2-PL model

Figure 1.1-11: 2-Parameter Item Characteristic Curves for Four Typical Items

Figure 1.1-11 shows that, unlike the item characteristic curves of the 1-PL in which

the curves have the same slope (usually fixed to 1, see Figure 1.1-10), the curves in

the 2-PL model have different slopes.

Figure 1.1-11 presents four items with different item discriminations. For Item 1,

b1=1.0 and a1=1.0; for item 2, b2=1.0 and a2=0.5; for item 3, b3= -1.0 and a3=1.5; for

item 4, b4=0 and a4=1.3. Both items 1 and 2 have the same item difficulty (b=1) but

( )

( )( )

1

i i

i i

Da b

i Da b

eP

e


they differ in item discrimination (a1=1.0 and a2=0.5); thus, item 1 has higher

discrimination than item 2). In the 2-PL model, the probability of a person with a

given ability level to get the right answer is determined by the item difficulty (b

parameter) and the item discrimination (a parameter) simultaneously.

1.2.7 3-Parameter Logistic (3-PL) Model

The three-parameter logistic (3-PL) model extends the 2-PL model by adding a third

item parameter in the model, a guessing parameter (also known as a pseudo-chance,

usually denoted c). The c parameter is the probability of endorsing the item for a

person with “zero symptoms”; it is the low point of the item characteristics curve as

it nears - on the horizontal axis. The 3-PL model144 can be expressed by the

following equation.

Equation 3

Where the Pi (), bi, ai and D are defined as for the 2PL-model and ci parameter

(pseudo-chance) is the third item parameter. Figure 1.1-12 shows an example item

characteristic curve for the 3-PL model. Item 1, b1=1.0, a1 =1.8 and c1=0.1; for item

2, b2= -1.5, a2 =1.8 and c2=0.5; for item 3, b3=1.0, a3 =1.8 and c3=0.3; for item 4,

b4=2.0, a4 =1.8 and c4=0.1. Both item 1 and item 4 have the same item discrimination

and pseudo chance (b=1.8 and c=0.1), but they differ in item difficulty (b1=1.0 and

b2=2.0); meanwhile item 1 and item 3 represent two items with similar item

difficulty (b =1.0) and item discrimination (a =1.8), but that differ in the pseudo

chance parameter (c1=0.1 and c3=0.3). It can be seen that the probability of getting

these two items right by guessing for a person with “zero ability” is different.

( )

( )( ) (1 )

1

i i

i i

Da b

i i i Da b

eP c c

e


Figure 1.1-12: 3-Parameter Item Characteristic Curves for Four Typical Items

1.2.8 Partial Credit Model

The partial credit models (PCM) [148] are an extension of Rasch dichotomous

models for polytomous data (items with more than two response categories). To

illustrate this, consider an example item used in Study 5: “What was your natural hair

colour at the age 18 years?” It would be expected that people with highest level of

skin cancer risk would get a score of 3 and people with lowest level of skin cancer

risk would get a score of 0. The scoring for this item is: 3 for red hair, 2 for blonde

hair, 1 for brown hair, and 0 for black hair. The general equation for the partial credit

model is [113].

Equation 4

Where Pig is the probability of responding in a specific item category for the partial

credit model, b is the location parameter, and g is the category boundary.

0

0

( )

( )

0

( )

l

igg

h

igg

b

igbm

h

eP

e


1.2.9 Rating Scale Model

The rating scale model (RSM)148 is a restricted version of the partial credit model,

where the distances between adjacent step difficulties are the same across all items.

The general equation for the rating scale model 113 is:

Equation 5

Where h=0,1,…,g,…,m, and g represent the specific category being modelled from

among m+1 category. Where bi is the item location parameter estimated for each

individual item in a scale and g are the threshold parameters that define the

boundary between the categories of the rating scale. g are estimated once for the

entire set of items. Likert-type questionnaires are commonly scored using the RSM.

All items in the Skin Self-Examination Attitude Scale have five ordered response

categories from strongly disagree, disagree, neither agree nor disagree, agree, and

strongly agree. The lowest categories will be scored 0 and the highest category will

be scored 4 (maximum category minus 1). For example, an item from Study 1 taken

from the Skin Self-Examination Attitude Scale 149 “Checking my skin regularly is a

priority for me”. Participant who answered strongly agree will get score of 4. Similar

to CTT, RSMs assume the distances between response categories are the same.

RSMs are commonly used for questionnaires with Likert type scales 64.

CHOOSING A MODEL

As described above, there are many IRT models, and the investigator is faced with

the decision of which one to use. Research to be conducted in the future beyond this

thesis, will explore other IRT models on current items as suggested by the examiner.

One determining factor is the type of item. For example, an item with two categories,

such as yes/no, would usually be analysed using a dichotomous model. For

dichotomous items, the data can be analysed either using 1-PL, 2-PL, or 3-PL IRT

model. Researchers decide which model to use based on a prior assumption

regarding whether the items have more than one parameter to be estimated. If there is

0

0

[( )]

[( )]

0

( )

l

i gg

h

i gg

b

igbm

h

eP

e


a high likelihood for guessing, then it is appropriate to use 3-PL model, which would

however seem unnecessary in the context of skin cancer risk, where guessing is

unlikely. The 3-PL model is only appropriate in a situation where multiple choice

items are used, and most health applications don’t usually require guessing

parameters.

For items with more than two answer categories, the polytomous model (rating scale

model, partial credit model or other) are more appropriate to use (see Table 1.4 for a

summary of available IRT models). In educational assessment, the 3-PL model is the

most common choice for multiple-choice items, as it is reasonable to assume a

person without the required knowledge will have a non-zero probability of choosing

the correct answer through chance alone.150 However, the Rasch model is often

chosen because it has some desirable mathematical properties that cannot be obtained

with IRT models (e.g. invariance, sufficient statistics of raw score, etc.) 151 The

Rasch model, the raw (observed) score is seen as containing sufficient statistics for

ability (). This means that all individuals with the same raw score will have the

same estimated latent score (). In analysing any data, the parsimonious principle152

should be followed, a simple model that can be estimated more accurately usually

produces better results than using a complex model (more parameters) that is

estimated poorly.153 The main advantage of the Rasch model is its parsimony,

allowing models to be successfully converged with a smaller sample required than 2-

PL or 3-PL models 154.

For this PhD program of research, various unidimensional Rasch models were used,

including the dichotomous model (Study 3), rating scale model (Studies 1 and 3), and

partial credit model (Studies 2, 3, 4, and 5) to investigate how each model can be

applied to different datasets. The selection of the models depended on the available

item characteristics in each study.

LIMITATIONS OF ITEM RESPONSE THEORY

Although there are many advantages of using IRT in analysing items or developing a

new scale, there are some limitations. First, there are a number of restrictive

assumptions144 of using the models that are difficult to meet. Second, large sample

sizes (both in terms of items and respondents) are required during scale development.


There is no exact guideline on sample size requirements for IRT analysis, samples

over 500 are generally recommended.80

Third, the lack of training and user-friendly computer programs to perform IRT

analysis is a barrier,112 although nowadays this is becoming less of an issue, because

some IRT software packages are now starting to provide user-friendly graphical

interfaces with point and click capability.155,156 Fourth, IRT has been developed and

used extensively in large scale educational assessment and knowledge about this

method is less commonly used in the development and assessment of health-related

scales and measures, especially in skin cancer-related research. Many researchers are

therefore not aware of the advantages and are not using the IRT approach as an

alternative to CTT. Considering those factors, the purpose of this thesis is to

introduce and familiarise IRT methods to researchers in epidemiology and public

health by demonstrating its application in skin cancer research.

PURPOSE OF THIS DOCTORAL WORK

The overarching purpose of this thesis was to:

1. Demonstrate the applications of IRT to skin cancer-related measurement

by examining the psychometric properties of existing skin-cancer-related

questionnaires. To achieve this goal, access to several large datasets was

negotiated from Centre of Research Excellence in Sun and Health

(CRESH) investigators (from QUTs Skin Awareness Study, Cancer

Council Queensland’s Melanoma Screening trial, and Melanoma Case-

Control Study, QUT and Australian National University’s AusD study,

and QIMR Berghofer’s QSkin study). Secondary data analyses of the

questionnaire data from all of these studies was conducted and calibrated

using a range of IRT models, including the dichotomous model, rating

scale model, and partial credit model (see Figure 1.1-13). Items with good

psychometric properties were integrated, retained, or modified and used in

the final study (Study 5).

2. To develop a new measure of skin cancer risk utilising IRT.

The final study’s (Study 5) objective was to develop a set of measures with

good psychometrics and IRT properties to measure: a) phenotype, b) sun


exposure behaviour, and c) sun protection behaviour. The significance of

this approach is described on page 34.

Figure 1.1-13: Overview of current doctoral work

RESEARCH QUESTIONS

The present study aimed to address the following research questions related to the

measurement of skin cancer risk:

1. Is it possible to calibrate existing attitude and behaviour scales related to

risk of skin cancer using item response theory approaches? (Studies 1, 3

and 4)

2. Can item response theory analysis be used to show the potential impact of

self-reported behaviour change on skin cancer risk due to concerns

regarding vitamin D? (Study 2)

3. Can item response theory reduce participants’ burden when measuring

skin cancer risk by using computer adaptive tests? (Study 3)

4. How much efficiency does a computer adaptive test offer compared to

non-adaptive testing? (Study 3)

5. Can an item response theory calibrated skin cancer risk scales predict

future development of non-melanoma skin cancer? (Study 4)


6. Is it possible to develop a skin cancer risk scale (SCRS) using an approach

grounded in item response theory that integrates all indicators of skin

cancer risk? (Study 5)

7. To what extent does the newly developed ‘Skin Cancer Risk Score’ fit the

Rasch model, and how well can it predict self-reported skin cancer

history? (Study 5).

SIGNIFICANCE OF THE THESIS

The research conducted and outcomes of this thesis have the potential to make

several important contributions to scholarship and practice within the field of skin

cancer research, public health, and psychometrics. The significance of the thesis is

summarised in the following points:

1. The IRT based skin cancer risk scale developed through this PhD provides

an interval scaled measure that facilitates interpretation of skin cancer risk

and comparisons among people. Peoples’ skin cancer risk can currently be

ordered based on various indicators (e.g. eye colour). People with blue

eyes have higher skin cancer risk compared to those with brown eyes;

however, their exact position on the skin cancer risk continuum is not yet

known. This will be more complicated if more indicators are added, for

example: Does a person with blue eyes, wearing sun screen, and working

indoors have a higher risk compared to a person with brown eyes, who

never wears sun screen, and works outdoors? Calibrating these items on a

common IRT scale allows the combination of these various indicators and

creates one single score for skin cancer risk, allowing further comparison

of the differences in skin cancer risk levels between persons and items.

2. An IRT scale provides a sample distribution-free and item distribution-free

measure.82,157 In other words, there is no need for any specific reference

norm to provide a person’s percentile.79,158 This is in contrast to CTT,

where a score must be interpreted in regards to the sample calibration

(obtained from a normative sample) to become meaningful. In IRT, both

persons and skin cancer risk indicators (items) can be directly located on

the common skin cancer risk scale,49,87,159 making it easy to compare


peoples’ risk of different skin cancer risk indicators, as long as the

indicator is calibrated on the scale.

3. A thoroughly developed scale that is in line, and fits, the Rasch model

makes it possible to construct an overall skin cancer risk indicator that

summarises a person’s skin cancer risk in different components (e.g., hair

colour, eye colour, skin colour, etc.). If the overall skin cancer risk

indicator works well, a person’s overall skin cancer risk levels could be

calibrated on the common scale, even if he/she answers only some of the

items from the whole scale (see Study 3 for more detail). This is important

in order to reduce the participants’ burden of completing long skin cancer

related questionnaires in the future.

4. This study was the first to develop and validate a comprehensive skin

cancer risk scale for Australian adults using an IRT approach. Skin cancer

risk assessment of this population is invaluable to understand the

contribution that sun exposure and sun protection behaviours make in

addition to phenotype, in order for preventive strategies to be

implemented. By applying the latest methods in a psychometrics approach,

this new measure is expected to improve the credibility of skin cancer

research findings in future studies.

From a practice standpoint, the results of this research will be directly relevant to the

Australian public health by providing a measure that allows more accurate

estimation, with known measurement quality and standard errors, of peoples’

phenotype, sun exposure behaviour, and sun protection behaviour. In so doing,

people may be better informed about their personal risk and may take precautionary

action to better protect themselves from the sun.

THESIS OUTLINE

The rest of this dissertation is organised as around a collection of five studies (four

already published and one currently under peer-review). The first research study is

“Evaluation of a skin self-examination attitude scale using an IRT model approach.”

The purpose of this study is to illustrate applications of IRT in analysing survey data

in public health setting, especially in skin cancer research. General assumptions and


characteristics of IRT models, such as unidimensionality and item fit are also

discussed. This study was published in Health and Quality of Life Outcomes.

The second research study, “Changes in self-reported sun-protection behaviours due

to concern about vitamin D status,” studies behaviour changes due to vitamin D

concern that may increase future skin cancer risk. The study examines a cross-

sectional survey across the four seasons (2009-10), where latitudes ranging from 19-

43°S assessed vitamin D attitudes and changes in sun protection behaviours out of

vitamin D concern. IRT was used to illustrate the potential effect of changing sun-

protection behaviour due to concern about vitamin D. This study was published in

Photochemistry and Photobiology.

The third research study, “Advantages of Mobile Computer-Adaptive Testing (CAT)

to Quickly Estimate Skin Cancer Risk,” studied the use of CAT to reduce response

burden in skin cancer risk assessment. The study compared the efficiency of non-

adaptive test and computer adaptive testing facilitated by a partial credit model

derived calibration. This study was published in the Journal of Medical Internet

Research.

The fourth research study, “Diagnostic Discrimination of Skin Cancer Risk Scale”

examined psychometrics properties of an existing skin cancer risk questionnaire and

assessed the scale’s ability to predict prospective skin cancer. The partial credit IRT

model was used in this study. The result was presented as a full paper at the

International Outcome Measurement Conference (Chicago, April 2015).

While the first four studies applied IRT to existing scales, the last used IRT to select

the items expected to measure skin cancer risk will and subjected them to use these

in an original sample of participants. The final (fifth) research study, “Development

and Psychometrics Evaluation of Skin Cancer Risk Scale Utilising IRT”, constructed

a skin cancer risk scale utilising modern test theory approach. The study combined

the best questions from existing skin cancer questionnaires and calibrated them using

a partial credit model IRT to create an underlying construct of skin cancer risk. A

draft manuscript has been prepared.

40 Chapter 2: Evaluation of Skin Self-examination Attitude Scale Using an Item Response Theory Model

Approach

Evaluation of a skin self-examination attitude scale using an item response

theory model approach

This chapter includes a peer-reviewed journal article published in Health and Quality

of Life Outcomes. This article evaluates the psychometrics properties of The Skin

Self-Examination Attitude Scale, a brief measure that allows for the assessment of

attitudes in relation to skin self-examination. A Rating Scale Model was applied to

the data.

Djaja, N., Youl, P., Aitken, J., & Janda, M. (2014). Evaluation of a

skin self-examination attitude scale using an item response theory

model approach. Health and Quality of Life Outcomes, 12(1), 189.

doi:10.1186/s12955-014-0189-x

Chapter 2: Evaluation of Skin Self-examination Attitude Scale Using an Item Response Theory Model Approach

41

ABSTRACT

Introduction: The Skin Self-Examination Attitude Scale (SSEAS) is a brief measure

that allows for the assessment of attitudes in relation to skin self-examination. This

study evaluated the psychometric properties of the SSEAS using Item Response

Theory (IRT) methods in a large sample of men ≥ 50 years in Queensland, Australia.

Methods: A sample of 831 men (420 intervention and 411 control) completed a

telephone assessment at the13-month follow-up of a randomised-controlled trial of a

video-based intervention to improve skin self-examination (SSE) behaviour.

Descriptive statistics (mean, standard deviation, item–total correlations, and

Cronbach’s alpha) were compiled and difficulty parameters were computed with

Winsteps using the polytomous Rasch Rating Scale Model (RRSM). An item person

(Wright) map of the SSEAS was examined for content coverage and item targeting.

Results: The SSEAS have good psychometric properties including good internal

consistency (Cronbach’s alpha = 0.80), fit with the model and no evidence for

differential item functioning (DIF) due to experimental trial grouping was detected.

Conclusions: The present study confirms the SSEA scale as a brief, useful and

reliable tool for assessing attitudes towards skin self-examination in a population of

men 50 years or older in Queensland, Australia. The 8-item scale shows

unidimensionality, allowing levels of SSE attitude, and the item difficulties, to be

ranked on a single continuous scale. In terms of clinical practice, it is very important

to assess skin cancer self-examination attitude to identify people who may need a

more extensive intervention to allow early detection of skin cancer.

Keywords: Skin cancer, Skin self-examination, Attitude scale, Item response theory,

Rating scale, Rasch model


Approach

INTRODUCTION

Melanoma is the fourth most common cancer among men and women in Australia.

Men aged 50 years or older are more likely than other groups to be diagnosed with

thick melanomas and have the highest mortality [1]. Skin self- examination (SSE)

has been shown to increase the detection of thin melanoma [2-4]. A case-control

study in the United States found a 60% reduced risk of melanoma mortality (OR

0.37; 95% CI = 0.16-0.84) in people who examined their own skin [4]. While the US

Preventive Services Task Force currently does not recommend population- based

screening for skin cancer due to the absence of randomised trials investigating the

mortality benefit of screening [5], the American Cancer Society does recommend

that adults perform SSE monthly [6] and Australian Cancer Councils suggest SSE at

three-monthly intervals [7]. SSE may be one method of identifying suspicious skin

lesions early, particularly given that patients are more likely to detect their own

melanomas [8]. A large case- control study conducted in Queensland, Australia

found that melanomas detected during deliberate SSE compared to those found

incidentally, were thinner [9]. As about half of all melanomas occur on parts of the

body that are difficult to see (especially the back) [10], it has been suggested that

whole-body SSE is necessary to optimise melanoma detection rate [11].

While melanoma incidence and mortality is highest in men 50 years or older, this

group is less likely to detect their own melanomas and were less likely to undergo

whole-body clinical skin examination compared to other population groups [12,13].

Both of which could contribute to their higher melanoma mortality rates. The

increased risk of thick melanoma in this group may be due, at least in part, to low

awareness and uptake of early detection behaviours, including SSE.

Several aspects of SSE are under-researched, and few studies have measured factors

which may contribute to whether or not people conduct SSE. One study by Manne

and Lessin [14], who developed a 17-item SSE benefits and barriers scale, found

only barriers (but no benefits) were associated with SSE performance in melanoma

survivors. The authors suggested that melanoma survivors rely strongly on their

doctors’ recommendation, minimising the impact of their personal attitudes, and that

further assessment among the general population is needed. Swetter et al [15] found

that SSE awareness (defined as having heard about the ABCD rule, reading about


43

skin cancer detection, and requesting information about skin cancer detection from

doctor) of female spouses of men with melanoma was significantly higher than that

of the men themselves.

We previously used several attitude or outcome expectation items within a large

study of melanoma screening, and found that positive attitudes was strongly

associated with intention to conduct SSE in the future [16]. However, the

psychometric qualities of the measure as a whole have not been assessed.

Measurement of subjective and latent constructs like SSE attitudes requires

rigorously developed and tested instruments in order to obtain data of the highest

possible quality. While in the past questionnaire quality including reliability and

validity was often assessed using classical psychometric approaches, increasingly the

advantages of item response theory (IRT) methods, including allowing more precise

estimates, assessment of unidimensionality, adaptive testing and assessment of

differential item functioning have been recognised. IRT methods are now applied to

measurement tools across a wide variety of health outcomes [17-21]. It was the aim

of this study to evaluate the measurement properties and unidimensionality of the

SSE attitude scale using a Rasch modelling approach.

METHODS

To examine measurement properties of the SSE attitudes scale we used data

collected from the Skin Awareness study [22]. The primary aim of that study was to

examine the impact of a video-delivered intervention with two mailed reminder

postcards compared to a written-materials- only control group on the prevalence of

SSE in men aged 50 years or older. The primary hypothesis was that the prevalence

of SSE in the video intervention group would increase by at least 10% more than in

the control. A 10% increase was determined as the minimal change deemed to be

clinically significant. Approval for this study was obtained from the Queensland

University of Technology ethics committee, and the trial was registered with the

Australian New Zealand Clinical Trials Registry (ANZCTR N12608000384358).

Trial methods and baseline participant characteristics as well as primary and

secondary outcomes have previously been reported in detail [22-24].


Approach

Study population

In total, 5000 potential participants (men aged 50 or older) were randomly selected

from the Australian electoral roll (enrolling to vote is compulsory in Australia), of

which 2899 potential participants with a valid telephone number were contacted by

mail. The study pack included a letter of invitation and a colored brochure featuring a

well-known sports and TV personality, with follow-up of non-respondents via one

postal reminder and up to two follow-up phone calls. Men who were too ill, could

not speak English, or had a previous history of melanoma were excluded. The overall

consent rate was 37% (969 of

2610 eligible); however, 39 men withdrew before the study began, leaving a final

sample of 930 men who were randomised to the control or intervention condition.

Men completed telephone interviews at baseline, at 7 and 13 months after receiving

either the video intervention or written brochures only control package.

For the present analysis, we used data from 831 men who completed the 13-month

assessment time point. Similar to factor analysis, where a minimal sample size of 10

is required per item by convention, a minimum sample size of 250 is generally

requested for analyses such as those conducted here [25].

Skin self-examination attitude scale

The skin self-examination attitude scale (SSEAS) developed, and previously used, in

a large community-based pilot trial of skin cancer screening [16], and was modified

for the Skin Awareness study to include items measuring SSE outcome expectancy

and planning for future SSE. The SSEAS includes a list of 10 items, answered on a

five point Likert scale ranging from strongly disagree, disagree, unsure, agree, and

strongly agree (all items listed in Table 2.1). The total score of the SSEAS can vary

between 0 and 40, where 0 indicate low and 40 high SSE attitudes. Good reliability

for the scales was found when assessing its internal consistency (Cronbach alpha

.80).

Data analysis

To test the measurement quality of the SSEAS beyond classical test theory, item

response theory (IRT) modelling was applied. In brief, IRT model measures the

relationship between an individual’s ability and an item difficulty, and models this as


45

a probabilistic function. Specifically, raw data from a rating scale are converted to an

“equal interval scale” in logits (log odd units), reflecting the item difficulty and

individual’s ability [26,27]. Data were analysed using the Winsteps Rasch

Measurement [28]. To analyse the SSEAS, with 5 answer options per item, the

polytomous Rasch Rating Scale Model (RRSM) was used.

The following data quality parameters were assessed:

Dimensionality analysis

We assessed whether the data derived from the men’s answers fitted the Rasch model

in order to assess unidimensionality of the underlying trait. To assess the fit of the

data to the Rasch model, item difficulty and fit statistics were calculated for each

item.

Item difficulty

The difficulty of each SSEA item is its point on SSEA logits – when SSEA is

expressed as a unidimensional continuum. For polytomous scales including the

SSEAS, this is the point at which each answer category has a 50% probability of

being endorsed. Winsteps ranks the items in a hierarchical order based on their item

difficulty. The item at the top has high item difficulty and thus is difficult for people

to endorse; the item at the bottom of the rank is an easy-to-endorse item. Item

difficulty is calculated in logits and placed on a linear interval continuum. The higher

the logit is, the more the item measures at a high SSEA difficulty level.

Item fit statistics

To determine item fit statistics, infit and outfit mean square (MNSQ) statistics were

calculated, which specify how well each item fits the Rasch model. Infit and outfit

MNSQ values should range from 0.6 to 1.4 [29]. These fit statistics represent the

difference between expected responses and observed responses. An item perfectly

fits with the model if they have MNSQ of 1. Values less than

1.0 (overfit) show the model predicts the data too well - causing summary statistics

(e.g., reliability), to report inflated statistics. Meanwhile values greater than 1.0

(underfit) show unmodeled noise (there is other source of variance in the data) -

these will degrade measurement.


Approach

The infit and outfit MNSQ represents the unstandarised degree of fit of data

observation to the Rasch model expected responses. While the infit MNSQ is

sensitive to unexpected patterns, the outfit MNSQ statistic is more sensitive to

outliers.

Differential item functioning (DIF)

DIF was assessed to examine if the intervention condition had an effect on the

hierarchy of item difficulties. Rasch assumes the hierarchy of the items to be the

same across groups: it should work uniformly, irrespective of groups, in our case, for

men in the intervention and control groups. For example, if an item is invariant

across groups, the item with the lowest difficulty on the SSEA continuum for the

intervention group has also the lowest difficulty for the control group. Instead of

calculating the item difficulties for the whole group, in DIF analysis they are now

calculated separately (per group). The current study used a multi-step method of

initially flagging items for potential DIF using the Mantel chi-square statistic,

followed by confirmation of DIF with two other tests (Standardised Liu-Agresti

Cummulative Common Log-Odds Ratio (LOR Z) and Standardised Cox’s

Noncentrality Parameter (COX Z)). All MH-based statistics were computed using

DIFAS 5.0 [30].

RESULTS

SSEAS data was available from 831 participants, 411 (49.5%) control group

participants with a mean SSEAS score of 4.1 (SD 0.49) and 420 (50.5%) intervention

group participants (mean SSEAS score of 4.1 (SD 0.50).

Unidimensionality

The Rasch analysis showed good reliability. Item reliability (replicability of item

placements along the scale) was 0.98 and person reliability was 0.68. Individual item

difficulty level ranged from – .58 to .54 logits, with a mean ± standard deviation

(SD) of 0 ± 0.41. Whereas person measures had a mean ± SD of 1.71 ± 1.40,

indicating that the items did not adequately target the SSEA levels of this sample.

Results of the unidimensionality analysis are shown in Table 2.1.


47

Item difficulty

Item difficulty estimates found that the easiest item to endorse for the participants

was item SSE_1 (- 0.58): “It is important to check my skin for skin cancer even if I

have no symptoms” while the most difficult item to endorse was item SSE_3 (0.54):

“Checking my skin regularly is a priority for me”.

Items SSE_3 and SSE_9 both had about the same item difficulty of 0.54 logits and

0.53 logits, with evidence for overlap between the items and thus redundancy of

items. In addition, items SSE_4 (0.23) and SSE_8 (0.18) measure a similar level of

SSEA evidenced by a separation distance of only 0.05 logits.

We also assessed the spread of item difficulty using the item-person map (Wright

map) displayed in Figure 2.1. This map indicates both the distribution of

participants’ SSEA propensity scores, and item difficulty levels. Both the items and

responses are displayed on a logit scale; respondents with the same SSEA propensity

scores as the item difficulty have a 50% chance of endorsing the item. The left hand

side of Figure 2.1 shows the distribution of respondents’ level of SSEA, people with

a higher SSEA are placed in the higher position and people with lower SSEA are

placed in the lower positions. The right hand side shows the distribution of item

calibrations, items reflecting higher SSE attitude are placed in higher position and

items reflecting a lower SSEA level are placed in lower positions.

M is the mean value (the default value of participants mean is set to 0), while S

labels one standard deviation and T labels two standard deviations of the item and

person distribution. The map shows that the participants’ average SSEA mean was

1.71 logit above the items’ mean, implying that participants have a high level of

SSEA.


Approach

Figure 2.1: Wright map/item person map of Skin Self-Examination Attitude Scale

with the mean theta of person on the left and mean theta of items on the right.

Content coverage and item targeting

A ceiling effect was evident in the results displayed in Figure 2.1, with many

participants located in the upper part of the map, and few items located in the

corresponding level. The SSEA of this sample was higher than that reflected in the

items. The mean of item measures was more than 1 standard deviation lower than the

mean of person measures, which indicates that all items were easily endorsed by this

sample, and additional items with greater difficulty are needed to complement the

scale.


49

Table 2.1: Item total correlation, fit statistics and item difficulty for the 10-item Skin

Self-Examination Attitude Scale

Item Total

Correlation

Mean Square Item

difficulty

(SE) Infit Outfit

SSE_1 It is important to check my

skin for skin cancer even if

I have no symptoms

0.457 0.92 1.03 - 0.58

(0.07)

SSE_2* I think checking my skin

would make me anxious*

0.081 - - -

SSE_3 Checking my skin regularly

is a priority for me

0.526 1.05 1.25 0.54

(0.05)

SSE_4 I think I could find

something suspicious on my

skin if it was there

0.495 0.99

1.06 0.23

(0.06)

SSE_5 If I saw something

suspicious on my skin, I'd

go to the doctor straight

away

0.446 1.03 1.06 -0.36

(0.06)

SSE_6 I am confident in a doctor's

ability to diagnose skin

cancer

0.373 1.20 1.32 -0.07

(0.06)

SSE_7** I have made plans on when

to examine my own skin*

0.461 - - -

SSE_8 I am confident that I can

take up examining my own

skin again even if I have not

looked at my skin in the

past few months

0.579 0.82 0.81 0.18

(0.06)

SSE_9 I am able to keep examining

my own skin regularly,

even if I have no one to

help me

0.474 1.03 1.34 0.53

(0.06)

SSE_10 If I regularly examine my

skin, then I am helping to

look after my own health

0.582 0.75 0.67 -0.46

(0.07)

*Item was removed due to low item total correlation

**Item was removed during calibration due to fit statistics beyond acceptable range

Item fit statistics

After an iterative process of calibration, all items of the SSEAS except SSE_2: “I

think checking my skin would make me anxious” and SSE_7: “I have made plans on

when to examine my own skin” were found to have inadequate MNSQ infit and

outfit outside the recommended values 0.6 and 1.4 [26] (Table 2.1) and overall did

not met the fit criteria. SSE_2 also did not contribute to the measurement of a

unidimensional construct.


Approach

Differential item functioning (DIF) assessment

The eight item SSEA scale was used to assess DIF by group condition (DVD

intervention and control). Result of DIF analysis is presented in Table 2.2, and

revealed that none of the eight items showed DIF according to participants’ group

condition.

Table 2.2: DIF statistics for the 8-item skin self-examination attitude scale

Mantel1 LORZ2 COX Z2

SSE_1 0.870 0.954 0.931

SSE_3 0.268 0.524 0.519

SSE_4 0.719 -0.842 -0.849

SSE_5 0.177 -0.422 -0.419

SSE_6 0.238 -0.485 -0.491

SSE_8 0.063 0.253 0.250

SSE_9 0.814 -0.904 -0.903

SSE_10 1.540 1.238 1.240 1Critical values of this statistic are 3.84 for a Type I error rate of 0.05 2A value greater than 2.0 or less than –2.0 may be considered evidence of the presence of DIF.

DISCUSSION

Regular monthly or 3-monthly SSE is currently recommended by a number of cancer

control agencies, particularly for those at high risk such as older men who carry the

greatest skin cancer burden of skin cancer. SSE could improve skin awareness and

rapid clinical skin examination. In combination this has potential to reduce the

physical burden, including mortality, caused by late diagnosis of melanoma [31,32].

Studies have shown that melanomas detected during a deliberate SSE rather than

found accidentally are thinner [2,33]. Attitudes towards SSE form an important

component in explaining the likelihood of conducting an SSE [34].

IRT has been used widely in evaluation education and health measures [19,35,36].

The current study used IRT analysis to further assess the psychometric properties of

the SSEAS. Data were analysed using the Rasch Rating Scale Model [37], which has

ideal metric properties for ranking an individual’s ability (the level of the attribute

measured) along with the item difficulty on a common scale. This Rasch model

allows for the comparison of individuals regardless of items used in the measurement


51

[38]. It also enables the generation of a joint measurement (common scale) of items

and people, provided that the data is fitted to the model’s requirements.

In this study, the overall fit statistics and reliabilities of the SSEAS were satisfactory.

However, the spread of item difficulty was not satisfactory, with most items located

on the lower end of the scale. This means the SSEAS will give more accurate

information for individuals who have low skin-self-examination attitude. Two items

(item SSE_2 and item SSE_7) did not perform as expected and were removed to

achieve better fit to the Rasch model expectations. This suggested that those two

items may be measuring a different domain of SSE. Item SSE_2: "I think checking

my skin would make me anxious" was suspected to measure anxiety rather than

attitude. Item SSE_7: "I have made plans on when to examine my own skin"

probably measures the planning aspect of SSE.

This item could form a separate scale with additional planning items that address the

specific aspects of optimal SSE performance (such as having a partner to help, or

having available a full size and hand held mirror, or good lighting) have been added.

The distribution of the SSEAS items reflects a wide range of individual differences,

with the average level of this trait in the current sample being higher than the average

difficulty level of the items. The difficulty level of the 8 items reflected a narrow

range of levels of skin self-examination attitude among men ≥ 50 years, thus not

allowing for the optimal discrimination of more positive attitude in this sample.

The DIF analysis according to study group showed that the functioning of the 8 items

on the SSEAS was consistent, and was considered equally difficult, for both

intervention and control groups. The items were sufficiently robust to allow for the

assessment of SSE attitudes regardless of the participant’s group. Thus, the answers

only quantified the individual’s level of SSE attitude, which was measured according

to the difficulty of the items and not because of other constructs explained by the

participant’s subgroup.

The present study has some limitations. People with a high risk of skin cancer may

feel social pressure to report higher SSEA than others, and this may have resulted in

a positive reporting bias to our SSEA score. Although there is no objective measure

of SSE, adding a social desirability scale such as the Marlowe-Crowne social

desirability scale [39] in future studies could allow assessment of the SSEA scale


Approach

against this criterion. The addition of more high SSEA items to extend the difficulty

range of the measure may also help to improve the scale. Finally, our sample

consisted entirely of men aged 50 years or older. Future research should examine

whether these results also hold for the broader population including sample from

other states in Australia, women and younger age groups.

CONCLUSION

Overall, the present study confirms the SSEA scale as a brief, useful and reliable tool

for assessing attitudes towards skin self-examination in a population of men 50 years

or older in Queensland, Australia. The 8-item scale shows unidimensionality,

allowing levels of SSE attitude, and the item difficulties, to be ranked on a single

continuous scale. In terms of clinical utility, the skin awareness scale can identify

people who may need a more extensive intervention. Clinician can encourage these

people to start skin self-examination regularly looking for any abnormal growth or

unusual changes, so they can have a better chance for a cure.

Competing interests

The authors declare they have no competing interests.

Authors’ contributions

ND drafted the original manuscript, data analysis and interpretation. PY, JA, MJ

were involved in the conception, design of the study and acquisition of data. All

authors were involved in the review of draft manuscripts and read and approved a

final version prior to submission.


53

REFERENCES

1. Geller AC, Swetter SM, Brooks K, Demierre MF, Yaroch AL. Screening,

early detection, and trends for melanoma: current status (2000-2006) and

future directions. J Am AcadDermatol. 2007; 57:555–572.

2. Berwick M, Begg CB, Fine JA, Roush GC, Barnhill RL. Screening for

cutaneous melanoma by skin self-examination. J Natl Cancer Inst. 1996;

88:17–23.

3. Carli P, De Giorgi V, Palli D, et al. Dermatologist detection and skin self-

examination are associated with thinner melanomas: results from a survey of

the Italian Multidisciplinary Group on Melanoma. Arch Dermatol. 2003;

139(5):607–612.

4. Berwick M, Armstrong BK, Ben-Porat L, et al. Sun exposure and mortality

from melanoma. J Natl Cancer Inst. 2005; 97:195–199.

5. United States Preventive Services Task Force. Screening for skin cancer:

Recommendations and rationale. Am J Prev Med. 2001; 20(3 Suppl):44–46.

6. Skin Cancer Prevention and Early Detection. [http://www.cancer.org/

cancer/cancercauses/sunanduvexposure/skincancerpreventionandearly

detection/skin-cancer-prevention-and-early-detection-toc]

7. National CancerPrevention Policy. Ultraviolet radiation. [http://wiki.

cancer.org.au/policy/UV/Effective_interventions/Melanoma_screening]

8. Baade PD, Balanda KP, Lowe JB. Changes in skin protection behaviors,

attitudes, and sunburn: in a population with the highest incidence of skin

cancer in the world. Cancer Detect Prev. 1995; 20:566–575.

9. Baade PD, Youl PH, English DR, Mark Elwood J, Aitken JF. Clinical

pathwaystodiagnose melanoma: a population-based study. Melanoma Res.

2007; 17:243–249.

10. Youl PH, Janda M, Aitken JF, Del Mar CB, Whiteman DC, Baade PD. Body-

site distribution of skin cancer, pre-malignant and commonbenign pigmented

lesions excised in general practice. Br J Dermatol. 2011; 165:35–43.

11. Weinstock MA, Martin RA, Risica PM, et al. Thorough skin examination for

the early detection of melanoma. Am J PrevMed. 1999; 17:169–175.

12. Janda M, Youl PH, Lowe JB, et al. What motivates men age > or =50 years to

participate in a screening program for melanoma? Cancer. 2006; 107:815–

823.

13. Kasparian NA, McLoone JK, Meiser B. Skin cancer-related prevention and

screening behaviors: a review of the literature. J Behav Med. 2009; 32:406–

428.

14. Manne S, Lessin S. Prevalence and correlates of sun protection and skin self-

examination practices among cutaneous malignant melanomasurvivors. J

Behav Med. 2006; 29:419–434.


Approach

15. Swetter SM, Layton CJ, Johnson TM, Brooks KR, Miller DR, Geller AC.

Gender differences in melanoma awareness and detection practices between

middle-aged and older men with melanoma and their female spouses. Arch

Dermatol. 2009; 145:488–490.

16. Janda M, Youl PH, Lowe JB, Elwood M, Ring IT, Aitken JF. Attitudes and

intentions in relation to skin checks for early signs of skin cancer. Prev Med.

2004; 39:11–18.

17. Velozo CA, Lai JS, Mallinson T, Hauselman E. Maintaining instrument

quality while reducing items: application of Rasch analysis to aself-report of

visual function. J Outcome Meas. 2000; 4:667–680.

18. Hawthorne G, Densley K, Pallant JF, Mortimer D, Segal L. Deriving utility

scores from the SF-36 health instrument using Rasch analysis. Qual Life Res.

2008; 17:1183–1193.

19. Belvedere SL, de Morton NA. Application of Rasch analysis in health care is

increasing and is applied for variable reasons in mobility instruments. J Clin

Epidemiol. 2010; 63:1287–1297.

20. Franchignoni F, Salaffi F, Giordano A, Carotti M, Ciapetti A, Ottonello M.

Rasch analysis of the 22 knee injury and osteoarthritis outcome score–

physical function items in Italian patients with kneeosteoarthritis. Arch Phys

Med Rehabil. 2013; 94:480–487.

21. Cook CE, Richardson JK, Pietrobon R, Braga L, Silva HM, Turner D.

Validation of the NHANES ADL scale in a sample of patients with report of

cervical pain: factor analysis, item response theory analysis, and line item

validity. Disability and rehabilitation. 2006 Jan 1;28(15):929-35.

22. Janda M, Baade PD, Youl PH, et al. The skin awareness study: promoting

thorough skin self-examination for skin cancer among men 50 years or older.

Contemp Clin Trials. 2010; 31:119–130.

23. Auster J, Neale R, Youl P, et al. Characteristics of men aged 50 years or older

who do not take up skin self-examination following an educational

intervention. Journal of the American Academy of Dermatology. 2012;

67:e57–e58.

24. Janda M, Neale RE, Youl P, Whiteman DC, Gordon L, Baade PD. Impact of

a video-based intervention to improve the prevalence of skin self-

examination in men 50 years or older: the randomized skin awareness trial.

Arch Dermatol. 2011; 147:799–806.

25. Linacre JM. Sample size and item calibration stability. Rasch Meas Trans

1994, 7:328.

26. Bond TG, Fox CM. Applying the Rasch Model: Fundamental Measurement

in the Human Sciences. New York: Routledge; 2012.

27. Fox CM, Jones JA. Uses of Rasch modelling in counselling psychology

research. J CounsPsychol. 1998; 45:30.


55

28. Linacre J. Winstep-Rasch Model Computer Program. Version 3.69. 1.16.

2010.

29. Wright BD, Linacre JM, Gustafson J, Martin-Lof P. Reasonable mean-square

fit values. Rasch Meas Trans. 1994; 8:370.

30. Penfield RD. DIFAS 5.0 - Differential Item Functioning Analysis System.

2012.

31. Kelly JW. Melanoma in the elderly–a neglected public health challenge. Med

J Aus.t 1998; 169:403.

32. Pollitt RA, Geller AC, Brooks DR, Johnson TM, Park ER, Swetter SM.

Efficacy of skin self-examination practices for early melanoma detection.

Cancer Epidemiol Biomarkers Prev. 2009; 18:3018–3023.

33. McPherson M, Elwood M, English DR, Baade PD, Youl PH, Aitken JF.

Presentation and detection of invasive melanoma in a high-risk population.

Journal of the American Academy of Dermatology. 2006; 54:783–792.

34. Auster J, Hurst C, Neale RE, et al. Determinants of uptake of whole-body

skin self-examination in older men. Behav Med. 2013; 39:36–43.

35. Lim SM, Rodger S, Brown T. Using Rasch analysis to establish the construct

validity of rehabilitation assessment tools. Int J TherRehabil. 2009; 16:251–

260.

36. Chen HF, Lin KC, Wu CY, Chen CL. Rasch validation and predictive

validity of the action research arm test in patients receiving stroke

rehabilitation. Arch Phys Med Rehabil. 2012; 93:1039–1045.

37. Andrich D. A rating formulation for ordered response categories.

Psychometrika. 1978; 43:561–573.

38. Andrich D. Rasch Models for Measurement. Thousand Oaks: Sage; 1988.

39. Fischer DG, Fick C. Measuring social desirability: short forms of the

Marlowe-Crowne social desirability scale. Educ Psychol Meas. 1993;

53:417–424.

58 Chapter 3: Self-reported Changes in Sun-Protection Behaviours at Different Latitudes in Australia

Self-Reported Changes in Sun-Protection Behaviours at different latitudes in

Australia

This chapter includes a peer-reviewed journal article published in Photochemistry

and Photobiology. This article investigates attitudes toward vitamin D and changes

in sun-protection behaviours due to concern about adequate vitamin D among people

living at four different latitudes with very different UV radiation exposure levels in

Australia.

Djaja, N., Janda, M., Lucas, R. M., Harrison, S. L., van der Mei, I.,

Ebeling, P. R., Neale, R. E., Whiteman, D. C., Nowak, M., and

Kimlin, M. G. (2016). Self-Reported Changes in Sun-Protection

Behaviours at different latitudes in Australia. Photochemistry and

Photobiology. doi:10.1111/php.12582

Chapter 3: Self-reported Changes in Sun-Protection Behaviours at Different Latitudes in Australia 59

ABSTRACT

Sun exposure is the most important source of vitamin D, but is also a risk factor for

skin cancer. This study investigated attitudes toward vitamin D, and changes in sun

exposure behaviour due to concern about adequate vitamin D. Participants (n=1,002)

were recruited from four regions of Australia and completed self- and interviewer-

administered surveys. Chi-square tests were used to assess associations between

participants’ latitude of residence, vitamin D-related attitudes and changes in sun

exposure behaviours during the last summer. Multivariate logistic regression

analyses were used to model the association between attitudes and behaviours.

Overall, people who worried about their vitamin D status were more likely to have

altered sun protection and spent more time in the sun people not concerned about

vitamin D. Concern about vitamin D was also more common with increasing

latitude. Use of novel Item Response Theory analysis highlighted the potential

impact of self-reported behaviour change on skin cancer predisposition due concern

to vitamin D. This cross sectional study shows that the strongest determinants of self-

reported sun-protection behaviour changes due to concerns about vitamin D were

attitudes and location, with people at higher latitudes worrying more.

Keywords: self-report; sun exposure; sun-protection behaviours; vitamin D; item

response theory; rasch; population-based study


INTRODUCTION

Exposure to ultraviolet (UV) radiation from the sun causes about 90% of the global

skin cancer burden (1-3). The International Agency for Research on Cancer

summarised the most recent evidence for the carcinogenicity of solar radiation.

While there are some differences in the patterns and timing of exposure that give rise

to different types of skin cancer, overall, greater sun exposure significantly increases

skin cancer risk (4). Therefore, minimising sun exposure or protecting the skin when

outdoors by using clothing, shade and sunscreen is recommended when the UV

Index is 3 (5).

Vitamin D is synthesised when the skin is exposed to sunlight, or is consumed in

vitamin D-containing foods (naturally or fortified) or supplements (6). Research

indicates that vitamin D deficiency may increase the risks not only of diseases of

bone, but may also contribute to a wide range of other adverse outcomes such as

cancer and immune-modulated diseases (7-10). This has led to interest in defining

the optimal level of vitamin D and determining how to best achieve such a level (11,

12). To overcome concerns that sun-protection practices may lead to vitamin D

deficiency, safe durations of unprotected sun exposure at different latitudes of

Australia have been proposed and sun protection message is not recommended when

the UV Index drops below 3 (13). Exactly how much sun exposure is required to

achieve sufficient levels of vitamin D is contentious, as there is little consensus on

the level considered ‘sufficient’ and vitamin D synthesis varies according to location,

time of year, time of day, weather, and personal factors such as skin type and body

mass index (14). In Australia, current recommendations for late autumn and winter in

those parts of Australia where the UV Index is below 3, are that sun protection is not

recommended (5, 15). During these times, to support vitamin D production it is

recommended that people are outdoors in the middle of the day with some skin

uncovered on most days of the week. Being physically active while outdoors will

further assist with vitamin D levels.

Consequently, health promotion messages for sun protection have become

complicated, with different messages conveyed for different latitudes, seasons, times

of day and skin types. People are confused about when they need to protect

themselves, how much time they can spend outside, and how to balance the risk of

skin cancer versus that of vitamin D deficiency (16-19). Reflecting these concerns


there has been a huge increase in vitamin D testing in Australia, with costs to the

health care system rising from AUS$3.2 million in 2003 to AUS$143 million in 2013

(20-23). It is unknown, however, whether changes in sun-exposure behaviour are

more common in people who are concerned about achieving optimal vitamin D, and

whether this depends on where they live. The challenge now is finding the best way

to balance the risks and benefits of sun exposure and how to communicate this to the

general public (24).

Previous studies investigating knowledge, attitudes and behaviours related to

vitamin D and sun-protection have been limited by small sample sizes (16, 25-27) or

a focus on specific populations (28, 29). In this paper, we used Item Response

Theory (IRT) to assess the potential impact that behaviour change due concern to

vitamin D may have on skin cancer predisposition. IRT (modern test theory) offers

many advantages compared to classical test theory. It offers mathematical modelling

that specifies the probability of selecting each questionnaire item’s response option

as a function of the target latent trait (in our case skin cancer predisposition) being

measured. It therefore allows economical and precise assessment of the

characteristics under study and highlights specific targets for personalised

intervention. IRT is increasingly used in health research; examples include assessing

activity for post-acute care (30), and measures of physical functioning, health status,

and adolescent health risk behaviour (31-33). IRT allows computation of health

measures on an interval measurement scale (rather than ordinal scores provided by

most classical test theory-constructed health scales) and exploration of the

performance of each individual item rather than the scale as a whole. IRT

encompasses any mathematical model which attempts to predict observations from

locations on a latent variable. It uses logistic models including Rasch models,

Generalised Rating Scale models, or Samejima's Graded-Response models (34).

These models are widely used in education and patient-reported outcome

assessments (35-38). IRT-tested scales plot both respondent’s and item’s

measurements calibrated onto a common latent trait such as skin cancer

predisposition. IRT enables researchers to better visualise how changes in sun-

protective behaviours may influence underlying skin cancer predisposition.

The present study used data from a large population-based cross-sectional study (the

AusD Study), designed to assess vitamin D status and determinants across a range of


latitudes and seasons (39, 40). We aimed to a) assess the variation in attitudes and

behaviours according to residential location; b) identify the association between

participants’ attitudes about vitamin D and their self-reported changes to sun-

protection or exposure behaviours; and c) use IRT models to model the potential

effect on skin cancer predisposition that may occur if sun-protection behaviours

change due to concerns about vitamin D.

MATERIAL AND METHODS

The design, recruitment and main outcome measures of the multi-centre AusD Study

have been described previously in detail (39). Approval was obtained from four

institutional ethics committees. Potentially eligible participants were residents of 4

Australian cities (Hobart, Canberra, Brisbane, and Townsville) registered on the

Australian Electoral Roll [a compulsory register of Australian adults aged 18+ years)

and aged between 18 and 75 years. Exclusion criteria were: insufficient command of

English; an impairment or illness that prevented attendance at the interview; a

bleeding disorder; or positivity for hepatitis B virus, hepatitis C virus, or human

immunodeficiency virus. Participants completed a mailed health questionnaire

followed by two personal interviews at their local study site. At the end of the second

interview, a 20-mL venous blood sample was collected from each participant to

measure concentrations of serum 25-hydroxyvitamin D (25OHD). The serum and

buffy coat were processed using standard procedures, before storage locally in a –

80°C freezer. The final study sample was representative of the underlying population

based on a set of parameters (gender, age group, country of birth, perceived health

status, body mass index and smoking status) available from the population-based

2007–2008 National Health Survey; most participants (80.4%) had been born in

Australia, full-time workers who worked primarily indoors and considered

themselves to have fair-to-medium skin colour (participants were asked to self-report

their skin colour by a reference to a Fitzpatrick Skin Type chart (41)), brown hair,

and blue or grey eyes as previously reported (39). This analysis uses data from the

self-administered questionnaire (demographic characteristics and questions about the

way participants protect themselves from the sun), interviewer-administered

questions (phenotypic characteristics, skin cancer- and vitamin D-related attitudes,

use of sun protection, and changes in sun exposure behaviours due to concern about

vitamin D), and blood sample to measure concentrations of (25OHD).


Attitudes towards vitamin D:

Three questions assessed attitudes towards vitamin D (‘I worry about getting enough

vitamin D’; ‘I need to spend more time in the sun during summer for a healthy

vitamin D level’; ‘It is more important to stay out of the sun than to get enough

vitamin D’), with answer categories using 5-point Likert scales ranging from

strongly agree to strongly disagree. An option for ‘can’t say’ was included.

Participants were also asked whether they had noticed any news stories about

vitamin D (yes, no, unsure).

Change in sun-exposure or sun-protection behaviours due to concern about

vitamin D

Participants were asked if they had made changes to their personal sun-protection or

sun-exposure behaviours during the previous summer in order to get enough vitamin

D (“Did you try to wear shorts more often? Did you try to wear a hat less often? Did

you try to wear sunscreen less often? Did you spend more time in the sun? Any other

changes?”). Answer categories were yes, no, or can’t say. Fewer than 2.5% of

participants answered ‘can’t say’; these responses were combined with the ‘no’

category. Excluding participants who answered ‘can’t say’ did not significantly

change the results.

Statistical analysis: Prior to analysis we grouped the response categories strongly

disagree/disagree/neutral and strongly agree/agree. For the Rasch analysis, we

recoded the sun protection behaviour items (items 1-6 in Table 3.4) so that a higher

score indicated higher skin cancer predisposition as follows: strongly disagree = 4,

disagree =3, neutral =2 , agree =1 , and strongly agree = 0; and less sun protection

behaviour (item 7-11 e.g.: try to wear a hat less often) as follows: strongly disagree =

0, disagree =1, neutral =2 , agree =3 , and strongly agree = 4. We used chi-square

tests to compare attitudes, and changes in sun-protection behaviours, stratified by

participants’ locations. We also used chi-square tests to determine whether reported

changes in sun-protection behaviours to get more vitamin D varied according to

attitudes towards vitamin D or having heard news reports about vitamin D. Bivariate

logistic regression analyses were used to determine sociodemographic and skin

cancer risk factors associated with changes in sun protection behaviours. Factors that


were statistically significant (p<0.2) in the bivariate analyses and did not show

evidence for multi-collinearity were then included as adjustment factors in the

multivariate logistic regression analyses. Multivariable logistic regression models

were used to assess whether changes in sun-exposure or -protection behaviours (yes

or no) were influenced by vitamin D related attitudes, adjusted for age, sex, location,

education, indoor or outdoor work, ability to tan and participants’ measured

concentrations of serum 25OHD. We repeated the models adjusting for season (data

not shown), but results remained unchanged and the former more parsimonious

models are reported.

Item response theory: The matrix of responses of 1,002 participants to the attitude

items was subjected to Rasch analysis using the Andrich rating scale model for

polytomous data (42). Rasch models are a variant of IRT that model a relationship

between the levels of that latent trait (for this study skin cancer predisposition) and

the items used for measurement. In clinical assessment, the concept behind IRT is

that participants respond to items in a questionnaire based on the extent of the latent

trait (equivalent to person ability in Rasch analysis of a physical disability

instrument). Therefore, a person with an average level of severity of skin cancer

disposition will likely report that they had less sun exposure behaviours compared to

people with greater skin cancer predisposition. Severity of skin cancer predisposition

is expressed in terms of log odds or “logits,” and persons and items are mapped

along the same scale. Logit-transformed measures represent linear measures skin

cancer predisposition. For an item, a logit represents the log odds of the extent of an

item relative to the position of that item within the total set of items analysed. Logits

of higher positive magnitude represent a participant who has higher skin cancer

predisposition. We applied IRT models to assess the item information functions of 29

self-and interviewer administered questions when on an underlying latent trait of skin

cancer predisposition: eighteen items measured phenotype and typical sun exposure

behaviours, six items measured sun-protection behaviours and five items measured

changing sun-protection behaviours due to concern about getting enough vitamin D.

Item information is the contribution that an individual item makes to the total

information of a measured latent construct and shows where on the underlying latent

construct each item measures optimally (43). In general, item information functions

tend to look bell-shaped. Highly discriminating items have tall, narrow information


functions; they contribute greatly but over a narrow range. Less discriminating items

provide less information but over a wider range. Plots of item information can be

used to see how much information an item contributes and to what portion of the

scale score range. Calibration into the Rasch Partial Credit Model (RPCM) (44) was

completed using ACER ConQuest software (45). Calibration is the procedure of

estimating a person’s ability (in this case the person’s skin cancer predisposition) and

item difficulty (propensity to endorse an item) by converting (scaling) raw scores to

logits on an underlying uni-dimensional measurement scale.

Unweighted and weighted fit statistics were used to check the quality of the scale

from the Rasch model perspective. The mean square error (MNSQ) fit statistic is a

measure of the extent to which the data match the specifications of the model. As in

common practice in Rasch analysis, items that don’t fit with the model are removed.

Values of unweighted and weighted MNSQ can range from 0 to positive infinity with

an ideal value of 1.0 indicating that the data perfectly fit the model. Values below 1.0

suggest that variation in the observed data is over-predicted by the Rasch model

while values above 1.0 show that variation in the observed data is greater than that

predicted by the model. Currently there is no standard cut-off value for MNSQ;

different acceptable ranges are used to indicate good-fit of the model. We used a

relatively strict standard (unweighted MNSQ values between 0.75 and 1.33) as a

criteria and indication of good-fit (46). Once the skin cancer predisposition scale was

calibrated, we plotted each item and its response categories along this underlying

latent trait logit scale which is expressed as theta, with 0 representing the mean skin

cancer predisposition. To illustrate the potential effect of changing sun-protection

behaviour due to concern about vitamin D, we plotted a hypothetical example for a

person endorsing items that confer a high or low skin cancer predisposition to show

the impact on the underlying construct of skin cancer predisposition.

RESULTS

Of 11,713 people approached, 1,269 agreed to participate and 1,002 provided data

(overall study participation rate 9.1%). Demographic and phenotypic characteristics

of the sample have been previously reported (39). The distribution of participants

was approximately equally spread between the four study locations. The average age

of participants was 48 years (SD 16) and 46% were male. Over 80% of participants

were born in Australia and most had fair or medium skin colour (90%) and green,


hazel, grey or blue eyes (80%) placing them environmentally and constitutionally

(having a phenotype that confers overall higher than average risks of developing skin

cancer based on accumulated epidemiologic evidence e.g. skin type 1, red hair, lack

of tanning ability and propensity to freckle and burn).at risk of skin cancer. Fifty-six

participants had serum 25(OH)D levels below 25nmol/L; a significantly greater

proportion of these participants (32.1%) were worried about not getting enough

vitamin D compared to participants with level above 25 nmol/L (24.0 %, p<0.03).

Participants from Canberra were more likely than those from other locations to: work

indoors (81% vs 68%) (p<0.001); have a bachelor degree (30% vs 22%; p<0.001);

and be born outside Australia (29% vs 16%, p<0.001). Participants from Canberra

were less likely to report fair skin than other participants (50% vs 68%; p<0.001),

while participants from Hobart were more likely to report blue, grey or green eye

colour compared to participants from elsewhere (64% vs 52%; p<0.001). A larger

proportion of participants from Hobart entered the study in spring while a larger

proportion of participants from Canberra participated during winter.

Vitamin D-related attitudes and change in sun-protection/exposure behaviours

due to concern about vitamin D, stratified by location

Concerns about vitamin D, and reported change in sun-protection or sun-exposure

behaviour due to those concerns, increased with increasing latitude (Table 3.1). For

example, 18% of participants from Townsville, 21 % of participants from Brisbane,

31% from Canberra and 40% from Hobart agreed with the statement ‘I need to spend

more time in the sun during summer for a healthy vitamin D level’ (p<0.001).

Overall, between 4 and 15% of participants reported that they had changed their sun-

exposure or -protection behaviours during the previous summer to get sufficient

vitamin D. People from Hobart were significantly (p<0.001) more likely to report

wearing shorts (24%) and spending more time in the sun due to concern about

vitamin D (28%) than those from Brisbane or Townsville (8-10%). There were no

significant differences in hat and sunscreen use or other sun-protective behaviours

according to participants’ locations (Table 3.1), although these behaviours also

followed a latitudinal gradient.


Associations between vitamin D-related attitudes and sun protection behaviours

A larger proportion of people who worried about vitamin D or who felt they needed

to spend more time in the sun for vitamin D production reported that they had altered

their sun-exposure behaviours during the last summer (Table 3.2). In adjusted

multivariable logistic regression analyses, those who worried about getting enough

vitamin D wore sunscreen less often (adjusted OR=3.2; 95CI 1.6-6.2; p=0.001) and

shorts more often (adjusted OR=1.6; 95CI 1.0-2.6; p=0.04) and tended to spend more

time in the sun (adjusted OR=2.4; 95CI 1.5-3.7; p<0.001). Those who agreed that

they needed to spend more time in the sun in summer for a healthy vitamin D level

were less likely to wear a hat (adjusted OR=2.6; 95CI 1.2-5.6;p=0.04) or sunscreen

(adjusted OR=2.6; 95CI 1.3-5.0;p=0.004), and more likely to wear shorts (adjusted

OR=3.0; 95CI 1.9-4.7;p<0.001) and increase the amount of time spent in the sun

(adjusted OR=4.2; 95CI 2.8-6.4;p<0.001) (Table 3.3).

There were no significant differences in participants’ self-reported sun-protection

behaviours according to whether or not they had heard any ‘news about vitamin D’

or agreed or disagreed with the statement ‘it’s more important to stay out of the sun

than to get enough vitamin D’.

Potential effect of changes in sun-protection behaviour and underlying skin

cancer predisposition

For ease of interpretation, we transformed the person ability score (the skin cancer

predisposition score) from a logit score into a T-Score (see supplement 2) which

follows a T-score distribution with a mean of 50 and standard deviation of 10.

Overall the current participants were found to have skin cancer predisposition below

the mean (Mean=44.10). Table 3.4 shows the item locations and the scale and fit

statistics (MNSQ statistic) of selected sun exposure behaviour items within the

calibrated skin cancer predisposition latent trait continuum, expressed on a logit

scale. Estimates below 0 (negative) represent a low skin cancer predisposition, while

those above 0 (positive) represent an increasingly high skin cancer predisposition,

based on the self- and interviewer administered questions. The overall item

parameter estimates show that all 11 items fitted the skin cancer predisposition scale

well, as all were located within the recommended MNSQ bounds of 0.75 – 1.33.

Figure 3.1 visualises two items assessing hat wearing behaviours on calibrated skin


cancer predisposition scale. Compared to a hypothetical person who agrees with the

item “wear a hat” (i.e., skin cancer predisposition <0), a person who endorses the

item ‘try to wear a hat less often’ will be assigned a score well above 0. A Wilcoxon

Signed-Ranks Test indicated that the item location of concern about vitamin D were

statistically significantly higher than the item location of sun protection behaviour

(Z=-2.023, p=.043). This shows the potential effect of changing sun protection

behaviours.

DISCUSSION

Approximately one quarter of the participants were concerned about their vitamin D

status and believed they needed to spend more time in the sun. Although only 4%

reported changing their hat-wearing behaviours, 15% reported that they tried to

spend more time in the sun in the previous summer to synthesise enough vitamin D.

Attitudes about vitamin D and changes in sun-protection behaviours were

significantly related to each other and differed according to the latitude at which the

participant lived.

The United States Preventive Services Task Force recently reviewed the evidence on

the effect of vitamin D on fractures, cancers and other chronic disease prevention,

and concluded that while there is some positive evidence for fracture prevention, the

evidence for other chronic diseases is still inconclusive (47, 48). Given the

uncertainties surrounding the role of vitamin D in health, the known skin cancer-

inducing effects of sun exposure, and our findings suggesting a close association

between attitudes towards vitamin D and sun exposure behaviour, it is important to

ensure that public concern about vitamin D does not jeopardise skin cancer

prevention messages (29, 28).

IRT models graphically highlight the potential impact of self-reported behaviour

change on skin cancer predisposition. Cancer Council Australia’s Skin Cancer

Committee has updated their skin cancer prevention messages to accommodate the

balance between the risks and benefits of sun exposure; for example, they have

contributed to a position statement which recommends sun protection if the UV

Index is ≥ 3 but also “exposing the face, arms and hands or the equivalent area of

skin to a few minutes of sunlight on either side of the peak UV periods on most days

of the week” (49). A previous study (50) found that sun exposure to the arms and


legs as little as two exposures per week of 5 minutes duration may be sufficient to

main adequate vitamin D >30nmol/L (depending on time of day, season, etc.). One

of the concerns with changing the sun-protection messages provided by preventive

health authorities is that people may be confused. For example, should they discard

hats and sunscreen in order to optimise vitamin D regardless of where they live? Our

finding that vitamin D-related attitudes and self-reported changes in sun-protective

behaviours increased with increasing latitude is reassuring and is consistent with the

messages and position statements issued by health authorities which recommend, for

example, to discard use of hats only in the southern states of Australia in winter (51).

Once adjusted for relevant confounders, latitude and 25OHD level, only people who

worried about vitamin D, and those who specifically thought that they needed to

spend more time in the sun for vitamin D production, had higher odds of having

changed their sun-exposure behaviours. These findings suggest that people make

choices about their sun exposure based on their attitudes and environment (latitude),

and more research on these interactions is needed to determine what influences these

attitudes. We previously found that people obtained information through the media

(19, 28), but in this study we did not observe a strong association between having

heard about vitamin D on the news and change in either attitudes or behaviours.

Future work needs to explore this in more detail and should address important issues

such as adding some questions about participants’ knowledge of sun protection and

vitamin D.

Study limitations: The main limitation of this study was its cross-sectional design.

Further research incorporating longitudinal assessment of 25OHD is needed to

determine whether people who are worried about vitamin D status actually have

lower 25OHD levels, and if so, whether additional sun exposure helps to increase

these levels.

While the AusD Study recruited participants from a population-based register of all

Australian voting adults, the participation rate was low, and only through its

sampling requirements achieved a similar proportion of men and women. Results

from this study may not be generalisable to general adult Australian population due

to low response rate (9.1%). Participants were more likely than nonparticipants to be

female (54.2% vs. 47.2% (P < 0.001) and older than age 39 years (P < 0.001) (39).


Overall this study attracted a higher proportion of women and older, indoor-working,

well-educated participants compared with the underlying population. It is possible

that these participants may have been more motivated to participate because they

were more concerned about vitamin D than non-participants.

CONCLUSION

We found that the strongest and most consistent determinant of self-reported sun-

protection behaviour changes due to concerns about vitamin D were attitudes and

location, with those at higher latitudes worrying more. Further research is needed to

understand what drives people’s vitamin D-related attitudes. This information may

be useful to inform public health strategies or to help people to make behavioural

choices that are consistent with their values.

Acknowledgements

The authors thank the AusD investigators who provided the data extract used in this

study. Ngadiman Djaja is supported by the National Health and Medical Research

Council of Australia (NHMRC) CRESH PhD scholarship. Rachel E. Neale is funded

by a NHMRC Senior Research Fellowship. Robyn M Lucas is supported by a

NHMRC Career Development Fellowship


REFERENCES

1. Armstrong BK, Kricker A, English DR. Sun exposure and skin cancer.

Australasian Journal of Dermatology. 1997; Feb 1; 38(S1):S1-6.

2. International Agency for Research on Cancer. Solar and Ultraviolet

Radiation. Vol. 55. (Edited by I A F R O Cancer), Lyon, France. 1992.

3. Armstrong BK, Kricker A. How much melanoma is caused by sun exposure?.

Melanoma research. 1993; Nov 1; 3(6):395-402.

4. The International Agency for Research on Cancer (2009) Radiation : A

review of human carcinogens. In IARC monographs on the evaluation of

carcinogenic risks to humans Vol. 100 D. Lyon, France.

5. World Health Organization, World Meteorological Organization, United

Nations Environment Programme and International Commission on Non-

Ionizing Radiation Protection. Global Solar UV Index: A Practical Guide.

(Edited by WHO). 2002.

6. Ross AC, Manson JE, Abrams SA, et al. The 2011 report on dietary reference

intakes for calcium and vitamin D from the Institute of Medicine: what

clinicians need to know. The Journal of Clinical Endocrinology &

Metabolism. 2011; Jan; 96(1):53-8.

7. Barnard K, Colón-Emeric C. Extraskeletal effects of vitamin D in older

adults: cardiovascular disease, mortality, mood, and cognition. The American

journal of geriatric pharmacotherapy. 2010; Feb 28;8(1):4-33.

8. Ginde AA, Scragg R, Schwartz RS, Camargo CA. Prospective Study of

Serum 25‐Hydroxyvitamin D Level, Cardiovascular Disease Mortality, and

All‐Cause Mortality in Older US Adults. Journal of the American Geriatrics

Society. 2009; Sep 1; 57(9):1595-603.9.

9. Tomson J, Emberson J, Hill M, et al. Vitamin D and risk of death from

vascular and non-vascular causes in the Whitehall study and meta-analyses of

12 000 deaths. European heart journal. 2013; May 7; 34(18):1365-74.

10. Bischoff-Ferrari HA, Giovannucci E, Willett WC, Dietrich T, Dawson-

Hughes B. Estimation of optimal serum concentrations of 25-hydroxyvitamin

D for multiple health outcomes. The American journal of clinical nutrition.

2006; Jul 1; 84(1):18-28.

11. Ben-Shoshan M. Vitamin D deficiency/insufficiency and challenges in

developing global vitamin D fortification and supplementation policy in

adults. Int J Vitam Nutr Res 2012; 82, 237-259.

12. F Holick M. Vitamin D, sunlight and cancer connection. Anti-Cancer Agents

in Medicinal Chemistry (Formerly Current Medicinal Chemistry-Anti-Cancer

Agents). 2013; Jan 1; 13(1):70-82.

13. The Australia and New Zealand Bone and Mineral Society, Osteoporosis

Australia, Australasian College of Dermatologists and The Cancer Council

Australia. Risks and benefits of sun exposure: Position statement. 2007.


Available at: http://www.cancer.org.au/policy-and-advocacy/position-

statements/sun-smart/.

14. Samanek AJ, Croager EJ, Gies P, Milne E, Prince R, McMichael AJ, Lucas

RM, Slevin T. Estimates of beneficial and harmful sun exposure times during

the year for major Australian population centres. Medical journal of

Australia. 2006; Apr 3; 184(7):338.

15. Hartley M, Hoare S, Lithander FE, et al. Comparing the effects of sun

exposure and vitamin D supplementation on vitamin D insufficiency, and

immune and cardio-metabolic function: the Sun Exposure and Vitamin D

Supplementation (SEDS) Study. BMC public health. 2015; Feb 10; 15(1):1.

16. Janda M, Kimlin M, Whiteman D, Aitken J, Neale R. Sun protection and low

levels of vitamin D: are people concerned? Cancer Causes & Control. 2007;

Nov 1;18(9):1015-9.

17. Scully M, Wakefield M, Dixon H. Trends in news coverage about skin cancer

prevention, 1993‐2006: increasingly mixed messages for the public.

Australian and New Zealand journal of public health. 2008; Oct 1; 32(5):461-

6.

18. Dixon H, Warne C, Scully M, Dobbinson S, Wakefield M. Agenda-setting

effects of sun-related news coverage on public attitudes and beliefs about

tanning and skin cancer. Health communication. 2014; Feb 7; 29(2):173-81.

19. Langbecker D, Youl P, Kimlin M, Remm K, Janda M. Factors associated

with recall of media reports about vitamin D and sun protection. Australian

and New Zealand journal of public health. 2011; Apr 1; 35(2):159-62.

20. Bilinski K, Boyages S. The rise and rise of vitamin D testing. BMJ. 2012.

21. Bilinski K, Boyages S. Evidence of overtesting for vitamin D in Australia: an

analysis of 4.5 years of Medicare Benefits Schedule (MBS) data. BMJ open.

2013; Jan 1; 3(6):e002955.

22. The Department of Human Services. Medicare Benefits Schedule (MBS).

2014. Available at:

http://www.medicareaustralia.gov.au/provider/medicare/mbs.jsp.

23. Bilinski KL, Boyages SC. The rising cost of vitamin D testing in Australia:

time to establish guidelines for testing. The Medical journal of Australia.

2012; Jul 16; 197(2):90.

24. Glanz K, Rimer BK, Viswanath K, editors. Health behavior and health

education: theory, research, and practice. John Wiley & Sons; 2008; Aug 28.

25. Vu LH, van der Pols JC, Whiteman DC, Kimlin MG, Neale RE. Knowledge

and attitudes about vitamin D and impact on sun protection practices among

urban office workers in Brisbane, Australia. Cancer Epidemiology

Biomarkers & Prevention. 2010; Jun 22:1055-9965.

26. Youl PH, Janda M, Kimlin M. Vitamin D and sun protection: the impact of

mixed public health messages in Australia. International Journal of Cancer.

2009; Apr 15; 124(8):1963-70.


27. Janda M, Youl P, Bolz K, Niland C, Kimlin M. Knowledge about health

benefits of vitamin D in Queensland Australia. Preventive medicine. 2010;

Apr 30; 50(4):215-6.

28. Nowak M, Harrison SL, Buettner PG, et al. Vitamin D status of adults from

tropical Australia determined using two different laboratory assays:

implications for public health messages. Photochemistry and photobiology.

2011; Jul 1; 87(4):935-43.

29. Harrison S, Büttner P, Nowak M. Maternal beliefs about the reputed

therapeutic uses of sun exposure in infancy and the postpartum period.

Australian Midwifery. 2005; Aug 31; 18(2):22-8.

30. Reid CA, Kolakowsky-Hayner SA, Lewis AN, Armstrong AJ. Modern

psychometric methodology applications of item response theory.

Rehabilitation Counseling Bulletin. 2007; Apr 1; 50(3):177-88.

31. Cella D, Chang CH. Response to Hays et al and McHorney and Cohen: A

discussion of item response theory and its applications in health status

assessment. Medical Care. 2000; Sep 1; 38(9):II-66.

32. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes

measurement in the 21st century. Medical care. 2000; Sep; 38(9 Suppl):II28.

33. Warne RT, McKyer EJ, Smith ML. An introduction to item response theory

for health behavior researchers. American journal of health behavior. 2012;

Jan 1; 36(1):31-43.

34. Linacre JM. What is item response theory, IRT? A tentative taxonomy. Rasch

Measurement Transactions. 2003; 17(2):926-7.

35. da Rocha NS, Chachamovich E, de Almeida Fleck MP, Tennant A. An

introduction to Rasch analysis for psychiatric practice and research. Journal

of psychiatric research. 2013; Feb 28; 47(2):141-8.

36. Leung YY, Png ME, Conaghan P, Tennant A. A systematic literature review

on the application of Rasch analysis in musculoskeletal disease—A special

interest group report of OMERACT 11. The Journal of rheumatology. 2014;

Jan 1; 41(1):159-64.

37. Lundgren-Nilsson Å, Jonsdottir IH, Ahlborg G, Tennant A. Construct validity

of the psychological general well being index (PGWBI) in a sample of

patients undergoing treatment for stress-related exhaustion: a rasch analysis.

Health and quality of life outcomes. 2013; Jan 7; 11(1):1.38.

38. Waller J, Ostini R, Marlow LA, McCaffery K, Zimet G. Validation of a

measure of knowledge about human papillomavirus (HPV) using item

response theory and classical test theory. Preventive medicine. 2013; Jan 31;

56(1):35-40.

39. Brodie AM, Lucas RM, Harrison SL, et al. The AusD Study: a population-

based study of the determinants of serum 25-hydroxyvitamin D concentration

across a broad latitude range. American journal of epidemiology. 2013; Mar

22; kws322.


40. Kimlin MG, Lucas RM, Harrison SL, et al. The contributions of solar

ultraviolet radiation exposure and other determinants to serum 25-

hydroxyvitamin D concentrations in Australian adults: the AusD Study.

American journal of epidemiology. 2014; Apr 1; 179(7):864-74.

41. Fitzpatrick TB. Soleil et peau. J Med Esthet. 1975; 2(7):33-4.


Psychometrika. 1978; Dec 1; 43(4):561-73.

43. De Ayala RJ. The theory and practice of item response theory. Guilford

Publications; 2013; Oct 15.

44. Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;

Jun 1; 47(2):149-74.

45. Adams R Wu M Wilson M ACER ConQuest 3.0.1. ACER, Melbourne,

Australia. 2013.

46. Wilson M. Constructing measures: An item response modeling approach.

Routledge; 2004; Dec 13.

47. Chung M, Lee J, Terasawa T, Lau J, Trikalinos TA. Vitamin D with or

without calcium supplementation for prevention of cancer and fractures: an

updated meta-analysis for the US Preventive Services Task Force. Annals of

internal medicine. 2011; Dec 20; 155(12):827-38.

48. Lips P, Gielen E, van Schoor NM. Vitamin D supplements with or without

calcium to prevent fractures. BoneKEy reports. 2014; Mar 5; 3.

49. Cancer Council Australia. Position statement: Screening and early detection

of skin cancer. 2007.

50. Holick MF. Vitamin D deficiency. New England Journal of Medicine. 2007;

Jul 19; 357(3):266-81.

51. Nowson CA, McGrath JJ, Ebeling PR, et al. Vitamin D and health in adults in

Australia and New Zealand: a position statement. Med J Aust. 2012; Jun 18;

196(11):686-7.


Table 3.1: Differences in vitamin D attitudes and sun protection behaviours by location1

Location

Townsville 19.3°S

N=259 (%)

Brisbane

27.5°S

N=254(%)

Canberra

35.3°S,

N=252(%)

Hobart

42.8°S

N=237(%)

p-value3

I worry about getting enough vitamin D 0.001

Agree2N=237 (24.0%) 30 (11.7) 53 (21.0) 63(25.5) 91 (39.1)

I need to spend more time in the sun during summer for a healthy vitamin D <0.001

Agree2N=270 (27.2%) 47 (18.3) 53 (20.9) 78 (31.1) 92 (39.5)

It is more important to stay out of the sun than to get enough vitamin D <0.001

Agree2N=160 (16.2%) 60 (23.3) 35 (13.9) 40 (16.1) 25 (10.8)

Have you ever heard news reports about getting vitamin D from sunlight 0.002

YesN=633 (64.7%) 139 (55.2) 165 (66.0) 167 (67.6) 162 (70.7)

Last summer did you make any changes to the way you protected yourself from the sun so you could get enough vitamin D?

Wear hat less often N=41 (4.1%) 8 (3.1) 7 (2.8) 11 (4.4) 15 (6.5) 0.46

Wear sunscreen less often N =57 (5.8%) 8 (3.1) 13 (5.1) 18 (7.2) 18 (7.8) 0.18

Wear shorts more often N=136 (13.8%) 21 (8.2) 26 (10.3) 33 (13.5) 56 (24.2) <0.001

Spend more time in the sun N=153 (15.5%) 20 (7.8) 23 (9.2) 44 (17.7) 66 (28.4) <0.001

Any other changes N=81 (8.2%) 20 (7.8) 19 (7.5) 21 (8.5) 21 (9.1) 0.29

1 n may vary slightly due to some missing values 2 Agree = combined categories of strongly agree/agree3 p-value from Chi Square test


Table 3.2: Vitamin D-related attitudes and self-reported changes in sun protection behaviours

Last summer did you make any changes to the way you protect yourself from the sun so you could get enough vitamin D?

Wear hat

less often

Wear sunscreen

less often

Wear shorts

more often

Spend more time in the sun Any other changes

Yes

N (%)

No

N (%)

Yes

N (%)

No

N (%)

Yes

N (%)

No

N (%)

Yes

N (%)

No

N (%)

Yes

N (%)

No

N (%)

I worry about getting enough vitamin D

Strongly disagree/ disagree/neutral

N 752 (76.0%) Strongly agree/agree

N 237 (24.0%)

22 (55.0)

18(45.0)

730(76.9)

219(23.1)

26(47.3)

29 (52.7)

726(77.7)

208 (22.3)

80(59.7)

54 (40.3)

672(78.6)

183 (21.4)

78(51.3)

74 (48.7)

674(80.5)

163 (19.5)

62(77.5)

18 (22.4)

690(75.9)

219 (24.1)

p < 0.001 p < 0.001 p < 0.001 p < 0.001 p = 0.75

I need to spend more time in the sun during

summer for a healthy vitamin D level

Strongly disagree/ disagree/neutral N 724 (72.8%)

Strongly agree/agree

N 270 (27.2%)

19(46.3)

22 (53.7)

705(74.0)

248 (26.0)

27(47.4)

30 (52.6)

697(74.4)

240 (25.6)

65(47.8)

71 (52.2)

659(76.8)

199 (23.2)

61(39.9)

92 (60.1)

663(78.8)

178 (21.2)

52(64.2)

29 (35.8)

672(73.6)

241 (26.4) p < 0.001

p < 0.001 p < 0.001 p < 0.001 p = 0.07

It is more important to stay out of the sun than to

get enough vitamin D

Strongly disagree/ disagree/neutral N 829 (83.8%)

Strongly agree/agree

N 160 (16.2%)

35(87.5)

5 (12.5)

794(83.7)

155 (16.3)

49(87.5)

7 (12.5)

780(83.6)

153 (16.4)

122(91.0)

12 (9.0)

707(82.7)

148 (17.3)

135(89.4)

16 (10.6)

694(82.8)

144 (17.2)

70(86.4)

11 (13.6)

759(83.6)

149 (16.4) p = 0.52 p = 0.44 p = 0.01 p = 0.04 p = 0.51

Have you ever heard news reports about getting

vitamin D from sunlight

No

N 345 (35.3%)

Yes

N 633 (64.7%)

11(27.5)

29 (72.5)

334(35.6)

604 (64.4)

20(35.7)

36 (64.3)

325(35.2)

597 (64.8)

47(35.6)

85 (64.4)

298(35.2)

548 (64.8)

59(39.6)

90 (60.4)

286(34.5)

543 (65.5)

26(32.1)

55 (67.9)

319(35.6)

578 (64.4) p = 0.29 p = 0.94 p = 0.93 p = 0.23 p = 0.53

1 N may vary slightly due to some missing values


Table 3.3: Multivariable logistic regression models of associations between vitamin D-related attitudes and changes made during the last summer

to the way people protected themselves from the sun so they can get enough vitamin D*

Try to wear a hat less

often?

Try to use

sunscreen less

often?

Try to wear shorts or short

sleeved clothing more often?

Try to spend more

time out in the sun?

OR (95%CI); p value OR (95%CI); p

value

OR (95%CI); p value OR (95%CI); p value

I worry about getting enough vitamin D

Strongly disagree/ disagree/neutral 1.0 1.0 1.0 1.0

Strongly agree/ agree 1.5 (0.7-3.4); 0.31 3.2 (1.6-6.2); 0.001 1.6 (1.0-2.6); 0.04 2.4 (1.5-3.7); 0.001

I need to spend more time in the sun during summer for healthy vitamin D level Strongly disagree/ disagree/neutral 1.0 1.0 1.0 1.0

Strongly agree/ agree 2.6 (1.2-5.6); 0.04 2.6 (1.3-5.0); 0.004 3.0 (1.9-4.7); <0.001 4.2 (2.8-6.4); 0.001

It is more important to stay out of the sun than to get enough vitamin D Strongly disagree/ disagree/neutral 1.0 1.00 1.0 1.0

Strongly agree/ agree 0.9 (0.3-2.7); 0.82 1.5 (0.6-3.8); 0.34 0.6 (0.3-1.1); 0.11 0.7 (0.4-1.3); 0.28

Have you ever heard news reports about getting vitamin D from sunlight

No 1.0 1.0 1.0 1.0

Yes 1.1 (0.5-2.3); 0.93 0.7 (0.4-1.3); 0.25 0.9 (0.6-1.4); 0.61 0.6 (0.4-1.0); 0.05

*For ease of reporting, all models are adjusted for age, sex, latitude, season, education, occupational exposure, ability to tan, measured 25OHD (continuous)


Table 3.4: Item location and fit statistics of sun protection behaviour items calibrated

within a skin cancer predisposition model.

Item No Item Estimated delta

(standard error)*

Unweighted Fit

MNSQ**

1 Wear a broad-brimmed hat -0.997 (0.032) 1.01

2 Wear a cap - 1.246 (0.039) 1.02

3 Wear any other head covering -1.834 (0.070) 1.00

4 Wear a shirt with long sleeves -0.634 (0.030) 0.99

5 Wear long trousers or clothing that

covers all or most of your legs

-0.399 (0.031) 1.01

6 Wear sun glasses -0.351 (0.031) 1.00

7 Try to wear a hat less often 2.542 (0.155) 1.00

8 Try to use sunscreen less often 2.188 (0.133) 1.00

9 Try to wear shorts or short sleeved

clothing more often

1.218 (0.091) 0.99

10 Try to spend more time out in the sun 1.092 (0.087) 0.99

11 Make any other changes to the way you

protect yourself from the sun

1.802 (0.113)

Abbreviations: MNSQ = Mean Square

*The estimate delta is the item location within a skin cancer predisposition continuum on a

logit scale. The score can be from negative infinity to positive infinity. Scores below 0

(negative) represent low skin cancer predisposition score and scores above 0 (positive)

represent increasingly high skin cancer predisposition score

** The fit of the items is evaluated using unweighted Mean Square (MNSQ). A MNSQ near

1 indicates a good fit. MNSQ <1 indicates an overfit, that is, the item discriminates more

than assumed in the model. MNSQ scores >1 usually occur if the discrimination of the item

is low; this is considered to be more serious violation to model fit than MNSQ <1.


Figure 3.1: Item Information Functions from two items plotted along the latent trait

logits of skin cancer predisposition

*Average skin cancer disposition is located at zero.

**Graph indicating that if a person were to endorse the item “wear a broad brimmed hat” their skin cancer risk is

below average, whereas if they endorse the item “Last summer …, tried to wear a hat less often” risk is above 0

(average).

Wear a broad-brimmed

hat

Try to wear a hat less

often

Skin cancer predisposition


Table 3.5: Supplement 1. Demographic and Phenotypic characteristics of the

participants (n=1,002)

Characteristics Participants

No. %

Sex

Male 459 45.8

Female 543 54.2

Age group, years

18-24 72 7.2

25-44 358 35.7

45-64 377 37.6

65-75 195 19.5

Country of birth

Australia 806 80.4

Other countries 196 19.6

Skin colour

Dark/Black 11 1.1

Olive 94 9.4

Medium 258 25.7

Fair 628 62.7

Missing 11 1.1

Natural hair colour at 18

Black 102 10.2

Brown 655 65.4

Blonde 200 20.0

Red 37 3.7

Missing 8 0.8

Eye colour

Brown 201 20.1

Hazel 247 24.7

Blue or Grey 482 48.1

Green 62 6.2

Missing 10 1.0


*The x-axis shows the participants’ skin cancer predisposition (converted to T score with Mean = 50,

SD=10)

Figure 3.2: Supplement 2. The distribution of the skin cancer predisposition (scores

converted to T score)

84 Chapter 4: Estimating Skin Cancer Risk: Evaluating Mobile Computer Adaptive Testing

Estimating Skin Cancer Risk: Evaluating Mobile Computer-Adaptive Testing

This chapter includes a peer-reviewed journal article published in Journal of Medical

Internet Research. This article evaluates the efficiency of non-adaptive testing and

computer adaptive testing to estimate skin cancer risk. A Dichotomous Model,

Rating Scale Model and Partial Credit Model were applied to the simulation data.

Djaja, N., Janda, M., Olsen, C. M., Whiteman, D. C., & Chien, T.-

W. (2016). Estimating Skin Cancer Risk: Evaluating Mobile

Computer-Adaptive Testing. Journal of Medical Internet

Research, 18(e22). doi:10.2196/jmir.4736

Chapter 4: Estimating Skin Cancer Risk: Evaluating Mobile Computer Adaptive Testing 85

ABSTRACT

Background: Response burden is a major detriment for questionnaire completion

rates. Computer adaptive testing may offer advantages over non-adaptive testing,

including reduction of numbers of items required for precise measurement.

Objective: To compare the efficiency of non-adaptive (NAT) and computer adaptive

testing (CAT) facilitated by Partial Credit Model (PCM) derived calibration to

estimate skin cancer risk.

Method: We used a random sample (two thirds) drawn from a population-based

Australian cohort study of skin cancer risk (n=43,794). All 30 items of the skin

cancer risk scale (SCRS) were calibrated with the Rasch PCM. A total of 1,000 cases

generated following a normal distribution (Mean=0,SD=1)were simulated using

three Rasch models with three fixed-item (dichotomous, rating scale and partial

credit) scenarios, respectively. We calculated the comparative efficiency and

precision of CAT and NAT (shortening of questionnaire length and the count

difference number ratio less than 5% using independent t tests).

Results: We found that use of CAT led to smaller person standard error(SE) of the

estimated measure than NAT with substantially higher efficiency but no loss of

precision, reducing response burden by 48%, 66%, and 66% for dichotomous, Rating

Scale Model, and PCM models, respectively.

Conclusions: CAT-based administrations of the SCRS could substantially reduce

participant burden without compromising measurement precision. A mobile on-line

computer adaptive test was developed to help people efficiently assess their skin

cancer risk.

Keywords: computer adaptive testing, skin cancer risk scale, Non Adaptive Test,

Rasch analysis, partial credit model


INTRODUCTION

In Australia, skin cancers account for approximately 80% of all newly diagnosed

cancers [1]. There are three main types of skin cancer: (1) melanoma (the most

dangerous form of skin cancer), (2) basal cell carcinoma (BCC), and (3) squamous

cell carcinoma (SCC). BCC and SCC are often grouped together as non-melanoma or

keratinocyte skin cancers. Australia’s incidence of skin cancer is one of the highest

in the world: two to three times the rates observed in Canada, the United States, and

the United Kingdom [2], with age-standardised incidence rates for cutaneous

melanoma at 65.3 x 10-5 and 1878 x 10-5 for keratinocyte cancer [1]. From a

population of only 23 million, more than 434,000 people are treated for one or more

non-melanoma skin cancers in Australia each year [1]. Ultraviolet radiation exposure

from sunlight is the major causal factor for skin cancer [2]. Personal behaviours to

reduce excessive sunlight exposure are important modifiable factors for the

prevention of skin cancers. The World Health Organization recommends several

suitable behaviours such as appropriate use of sunscreens, staying in the shade,

covering with sun protective clothing, giving up sunbathing, and abstaining from

using sunbeds [3].

Requirement for Model-Data-Fit Detection

In practice, we do not know the real skin cancer risk for a person. Thus, assuming a

person has characteristic attributes that correlate highly with the underlying construct

of skin cancer, risk can be assessed through questions (i.e., questionnaire items); for

example, phenotypic measures such as freckles, hair color, eye color, tendency to

burn, or behavioural factors such as attitudes to tanning and use of sunbeds. Using

the responses to these items, it should be possible to create a unidimensional (i.e.,

addable) scale to measure these attributes and calculate an overall skin cancer risk

score. Ideally, such a score would be precise and characterised by a small standard

error (SE).

Statistical validity is the correlation between each person’s measures (or scores) on a

questionnaire and those persons’ unobservable true status [4]. Such unobservable

variables (e.g., true score or behaviours relating to sun protection and sun exposure)

are considered latent traits (i.e., exists but cannot be directly observed). The question

is how to obtain optimal correlation (or validity) between the items when the true


score is unknown. Rasch models [5]can be a gateway to assess how well the items

measure the underlying latent trait [6-8].That is, a unidimensional scale can be

verified by Rasch analysis: when the data fit to the Rasch model, all items can be

added.

Questionnaires that are built and tested using the Rasch model have become common

in educational assessment for many years but are now also increasingly appreciated

in health assessment, including measures of patient outcomes (quality of life, pain,

depression) and other diverse latent traits such as perceptions of patient

hospitalisation and nurse bullying [9,10]. We previously applied the Rasch model to

the assessment of the quality of an instrument to measure attitudes to skin self-

examination [11]. Rasch analysis allows researchers to calculate a precise estimate of

the latent trait by assessment of unidimensionality of the items, assessment of

differential item functioning [12] (e.g., probability of giving a certain response on an

item by people from different groups with the same latent trait), and the possibility of

transferring static questionnaires to computer adaptive testing (CAT) [13].

Multimedia Graphical Representations to Improve Patients’ Health Literacy

Patients’ health literacy is increasingly recognised as a critical factor affecting

patient-physician communication and health outcomes [14], as a mediator for cancer

screening behaviour [15], and as a pathway between health literacy and cancer

screening [16]. Adults with below basic or basic health literacy are less likely than

adults with higher health literacy to get information about health issues from written

sources (e.g., newspapers, magazines, books, brochures, or the Internet) and more

likely than adults with higher health literacy to get a lot of information about health

issues from radio and television [17]. A mobile CAT with multimedia graphical

representations (i.e., similar to radio and television) could increase awareness of the

risk of developing skin cancer (i.e., health literacy) and motivate patient-physician

communication and subsequently behavioural change. However, no mobile CAT app

with graphical representations has been available until now.

Study Aims

Using data from a large cohort study of skin cancer from Queensland, Australia [18],

we conducted a simulation study with a methodological focus to apply Rasch models

to an existing skin cancer risk questionnaire. Further, we sought to compare static


(non-adaptive) presentation as commonly used in paper and pencil questionnaires

versus computer adaptive testing (CAT) for its precision in measurement. We

hypothesised that compared to non-adaptive testing (NAT), CAT would result in

greater precision (lower SE) for a similar item number or a shorter questionnaire of

similar SE.

METHODS

Data Source

De-identified data from the QSkin Sun and Health study baseline questionnaire were

used [18]. This is a population-based cohort study of 43,794 men and women aged

40-69 years randomly sampled from the population of Queensland, Australia, in

2011 (Figure 4.1). We randomly partitioned the data into a calibration dataset (two-

thirds, n=29,314) and a validation dataset (one-third, n=14,480). In the calibration

dataset, 7213 participants had a history of skin cancer and 22,101 participants did not

(Figure 4.2).

Approval for this study was obtained from the QIMR Berghofer Medical Research

Institute Human Research Ethics Committee (approval #P1309). Participants joined

the study by completing consent forms and the survey and returning them in a reply-

paid envelope. Participants completed two consent forms. The first consent form

covered the use of information provided in the survey, permission for data linkage to

cancer registries, pathology laboratories, and public hospital databases. The second

consent form gave permission for data linkage to Medicare Australia (Australia’s

universal national health insurance scheme) to ascertain whether or not participants

had developed skin cancer.


Figure 4.1: Sample selection flowchart

The baseline questionnaire consisted of 46 items and was answered by all QSkin

participants. All items were examined using the Rasch Partial Credit Model (PCM)

[19] (Figure 4.2). For optimal fit, the Rasch model requires a unidimensional

measurement with criteria of Infit and Outfit mean square errors of each item ˂1.5

[20]. PCM allows for items to have a variable number of thresholds and step

difficulties in contrast to the more commonly used Rating Scale Model (RSM)

[8,9,21], which requires all items to use the same response categories.

For item invariance, the item estimation should be independent of the subgroups of

individuals completing the questions and should work equally across populations

[22]. Items not demonstrating invariance are commonly referred to as exhibiting

differential item functioning (DIF)[23,24] or item bias. The chi-square test used for

detecting DIF was computed from a comparison of the observed overall performance

of each trait group on the item with its expected performance [25]. Its probability

(e.g.,P<.05) reports the statistical probability of observing a chi-square value when

the data fit the Rasch model. We used WINSTEPS [26] to detect items above the

thresholds for DIF.

In addition, the category structure for each of the items in the skincancer item bank

should display monotonically increasing thresholds following the Linacre’s

guidelines [27] to improve the utility of the resulting measures.


Determining a Cut-Off Point of Skin Cancer Risk

Traditionally in clinical practice, researchers use C-statistics, or area under the

receiver operating characteristic (ROC) curve to plot the true positive rate

(sensitivity) against the false positive rate (1 - specificity) at various threshold

settings [28]. In this study, we plotted two sample normal distributions incorporated

with ROC in Figure 4.3 when their means and standard deviations were known.

Much information such as cut point, area under ROC curve, and a graphical vertical

bar showing cut points can be displayed on a plot. WINSTEPS software [26] was

used to estimate means and standard deviations of cases with and without previous

skin cancers to determine a cut-off point of skin cancer risk with maximal sensitivity

and specificity in MS Excel (Figure 4.3). Providing the cut-off points in graphical

form makes the results clear and easily understandable for readers or clinicians to

interpret.


Figure 4.2: Study simulation and CAT flowchart.

Mobile Computer Adaptive Testing Designed for Examining Personal Skin

Cancer Risk

The CAT item bank (fitting to Rasch model’s requirement regarding

unidimensionality, local dependence, and monotonicity as well as DIF absence on

gender) was constructed, consisting of all 31-item parameters obtained from the

calibration using WINSTEPS [26].

To start the CAT, an initial item was selected randomly from the item bank. Using

this initial item, a provisional person measure was estimated by the expected a

posteriori (EAP) method [29] in an iterative Newton-Raphson procedure [9,30].


After each item was answered, EAP was recalculated, until the final score for the

person was determined by the maximum of the log-likelihood function before

terminating the CAT (Figure 4.2). The next item selection was based on the highest

Fisher information (i.e., item variance) of the remaining unanswered items

interacting with the provisional person measure.

Two termination rules were set. The first was a minimum standard error of

measurement (SEM) of 0.47 required for stopping the CAT. This SEM was set based

on the internal consistency of the calibration sample (Cronbach alpha=.78). SEi was

the person SE of the estimated measure according to their item variances of the

finished items on CAT, where SEM=SD xsqrt (1 - reliability) and

SEi=1/sqrt(information(i)), where i refers to the CAT finished items responded to

by a person [31], and SD is the person standard deviation of the derivation sample of

29,314 cases. The second termination rule was that each person must answer at least

10 items according to a simulation study on the data bank for attaining a minimal

average personal reliability at a desired level (e.g., 0.78) [32].

Simulation to Compare Efficiency and Precision of Computer Adaptive Testing

and Non-Adaptive Testing

Using the item parameters generated from the derivation cohort, 1000 cases

following a normal distribution (mean logit 0, SD logit 1) were simulated [33-35]

using three Rasch models (i.e., dichotomous, 5-point RSM, and PCM) with three

respective fixed-item scenarios (i.e., 10, 20, and 30 items; see Tables 4.1-4.3).

Table 4.1: 10, 20, or 30 items in static NAT format.

Datasets

Dichotomous RSM PCM

Mean SE Mean SE Mean SE

10 items -0.007 0.829 0.03 0.414 -0.179 0.398

20 items -0.008 0.555 0.02 0.289 -0.19 0.272

30 items 0.045 0.439 -0.039 0.235 -0.084 0.224

CAT -0.021 0.613 0.021 0.361 -0.154 0.32


Table 4.2: Precision of CAT.

Precision

Dichotomous RSM PCM

Diff. (%)a Corr.b Diff.(%)a Corr.b Diff.(%)a Corr.b

10 items 0.40 0.863 0.30 0.952 0.00 0.931

20 items 0.00 0.957 0.00 0.988 0.00 0.986

CAT 0.13 0.925 0.05 0.958 0.10 0.946 aDiff. (%): Different number ratio compared to the 30-item dataset. bCorr: Correlation coefficient of person theta to NAT.

Table 4.3: Efficiency of CAT.

Efficiency

Dichotomous RSM PCM

CAT item length %a CAT item length %a CAT item length %a

CAT 15.55 48.20 10 66.70 10.13 67.32 aEfficiency=1-CIL/30.

To allow testing of dichotomous and 5-point rating scale Rasch models, all item (or

step) difficulties were converted from the calibrated results of the PCM. The overall

difficulty for each item was designated to be the respective threshold of the

dichotomous scale. In contrast, the step difficulties of the 5-point RSM [21] ranged

from -2 to 2, with an advance 1.0 logit interval added to the overall difficulty of the

respective item as to the PCM.

We calculated the comparative efficiency and precision for CAT and NAT by

varying the number of items presented (10, 20, and 30 items) and by testing the

difference in precision and efficiency compared to answering all available 31-

itemsusing independent t tests to count different number ratio less than 5% as shown

in the following formula [36], respectively:

t=|cat - 30|/sqrt(SE2cat + SE2

30)

In addition, a comparison of average person SEs achieved across all different

conditions was made to verify precision for CAT and NAT. We ran an author-

created Visual Basic for Applications module in MS Excel to conduct the simulation

study (Figure 4.2) and mobile CAT.


RESULTS

Determining a Cut-Off Point

The mean and SD of skin cancer risk for participants without skin cancer (mean -

0.79, SE 1.67) or with skin cancer (mean 2.29, SE 2.21) were calculated and used to

determine the optimal cut-off point at 0.88 logit with sensitivity at 0.79 and

specificity at 0.74. Using this cut-off, the area under the ROC curve was 0.88 (see

Figure 4.3).

Figure 4.3: Determining a cut-off point


Simulation to Compare Efficiency and Precision of Computer Adaptive Testing

and Non-Adaptive Testing

Using simulation data, we found that using more items yielded higher Cronbach

alpha scores (Figure 4.4). Dichotomous scales had the lowest Cronbach alpha and

dimension coefficient [37]. The PCM scales had the highest Cronbach alpha. The

RSM scales gained the highest dimension coefficient.

As shown in Figure 4.4, CAT gained a relatively smaller SE corresponding to item

length (i.e., compared to NAT, shorter CATs result in larger SE). At equivalent

precision, CAT reduces the response burden by 48.20%, 66.70%, and 66.20%,

respectively for dichotomous, RSM, and PCM models. See Figure 4.5.

Figure 4.4: Generated with 3 Rasch Models

Mobile Computer Adaptive Testing Evaluating Skin Cancer Risk

We developed a mobile CAT survey procedure (see QR code in Figure 4.2 and

Multimedia Appendix 1) to practically demonstrate the newly designed PCM-type

CAT app in action. The CAT process was demonstrated item by item and is shown at

the top of Figure 4.6. Person theta is the provisional ability estimated by the CAT

module. The mean square error at the bottom of Figure 4.6 was generated by the

formula of 1/sqrt(Σinformation(i)), where i refers to the CAT presented items

responded to by a person [31]. In addition, the residual at the top of Figure 4.6 was


the average of the last five change differences between the pre-and-post estimated

abilities on each CAT step. CAT will stop if residual value ˂0.05. The “corr” refers

to the correlation coefficient between the CAT estimated measures and the step

series numbers using the last 5 estimated theta values. The flatter of the theta trends

means the higher probability of the person measure convergent to a final estimation.

Figure 4.5: Efficiency and precision of CAT and compared to using 10, 20 or 30

items in static NAT format.

DISCUSSION

Principal Findings

We used two different approaches to measure risk of skin cancer: non-adaptive

testing and computer adaptive testing. Using data from a very large cohort of more

than 43,000 people, we were able to show that our scale was able to accurately

identify people at highest risk for skin cancer. On our risk scale, we identified a very

high discriminatory accuracy of 0.88 (i.e., the proportion of area under ROC curve)

using a cut-off of 0.88 logits (the higher, the worse). Using CAT results in a smaller

SE at high efficiency (fewer items answered), and therefore without compromising

test precision, reduces response burden by 48.20%, 66.70%, and 66.20% for

dichotomous, RSM, and PCM models, respectively. A prototype mobile online CAT

for evaluating skin cancer risk has been developed and could be used to assess skin

cancer risk at considerable reduction of respondent burden.

Consistent with the literature [8,9,30,34,35], the efficiency of CAT over NAT was

supported for this skin cancer risk scale. We confirm the PCM-type CAT (i.e.,

different from others by using simpler Rasch family models) requires significantly

fewer items to measure a person’s risk than NAT but does not compromise the


precision of measurement. This mobile assessment could be used to quickly estimate

a person’s skin cancer risk and educate them about the need for skin protection on a

personal level [38-40]. We confirm that participants with a history of skin cancer had

a higher mean score of responses than those without a history of skin cancer.

Figure 4.6 A graphical CAT report shown after each response (top) and the more

item length, the less standard errors in CAT process (bottom)

Implications

Patients’ health literacy (e.g., understanding their own skin cancer risk) is

increasingly recognised as a critical factor affecting patient-physician

communication and health outcomes [14]. Adults with below basic or basic health

literacy are more likely than adults with higher health literacy to get information

about health issues from multimedia graphical representation [17], rather than the

traditional newspapers, magazines, books, brochures, or pamphlets. A brief CAT

such as the one we developed could be used to inform people quickly about their skin

cancer risk and how to improve their sun protection behaviours.


This CAT module is a practical tool that can gather responses from patients

efficiently and precisely. The tool offers diagnostics that can help practitioners assess

whether responses are distorted or abnormal. For example, outfit mean-square values

of 2.0 or greater suggest an unusual response. In instances where responses do not fit

with the model’s requirement, they can be highlighted for suspected cheating,

careless responding, lucky guessing, creative responding, or random responding [41];

otherwise, one can take follow-up action [8,34,35] if the result shows a high cancer

risk. For example, if a person’s measure/risk is 1.0 logit (i.e., log odds), their

probability of developing skin cancer approaches 0.53(=exp(1-0.88)/(1+exp(1-0.88)).

Interested readers can run a test of the mobile CAT through the QR code shown in

Figure 4.2.

A mobile online CAT could be used for evaluating skin cancer risk and might reduce

the item length in clinical settings. The CAT can be improved in the future by

expanding the item pool allowing use among more diverse samples. It must be noted

that (1) item overall (i.e., on average) and step (threshold) difficulties of the

questionnaire must be calibrated in advance using Rasch analysis or other item

response theory models before creating an item bank, (2) pictures used for the

subject or response categories for each question should be well prepared with a Web

link that can be shown simultaneously with the item appearing in the animation

module of CAT, and (3) the model can be used for many kinds of models based on

item response theory.

Strengths and Limitations

There are two major forms of standardised assessments in clinical settings [42]: (1) a

traditional self-administered questionnaire, and (2) a rapid short-form scale [43,44].

Each has its advantages and drawbacks. Traditional pencil-and-paper questionnaires

have a large respondent burden, often because they require patients to answer

questions that do not provide additional information about their risk of disease in

order to achieve adequate precision measurement [45]. CAT can target the optimal

question for a specific person and therefore end at an appropriate number of items

more economically according to the required SE (or say, criterion of person

reliability). However, along with the advantages offered by CAT, there are some

drawbacks as well, such as impossibility of estimating the ability in case of all


extreme responses, CAT algorithms requiring serious item calibration, several items

from the item bank being overexposed, and other test items not being used at all [46].

The strengths of this study include its very large sample size of more than 40,000

participants, permitting detailed analysis of the performance of questionnaire items

and the ability to further test the performance of the items in a validation dataset. We

simulated data by varying the types of models and item length to execute the CAT.

(Interested readers who wish to see the video demonstration or use the MS Excel-

type module can contact the corresponding author).

As with all forms of Web-based technology, advances in mobile health (mHealth)

and health communication technology are rapidly emerging [47]. Use of mobile

online CAT is promising and worth considering in many fields of health assessment,

similar to its prominent role in education and staff selection testing. However,

several issues should be considered more thoroughly in further studies. The scale’s

Cronbach alpha (=.78 yielded by studied 29,314 cases), sensitivity at 0.79, and

specificity at 0.74 are slightly low. Second, the CAT module has a potential

limitation for people using languages other than English because the interface may

need to be modified for use in real world. A multiple language interface should be

developed in the future. Third, the CAT graphical representation shown in Figure 4.6

might be confusing and difficult to interpret for people unfamiliar with CAT and may

need to be improved to become a standard part of CAT routine.

CONCLUSIONS

The PCM-type CAT for skin cancer risk can reduce respondents’ burden without

compromising measurement precision and increases endorsement efficiency. The

CAT module can be used for mobile phones and easy online assessment of patients’

disease risks. This is a novel and promising way to capture information about skin

cancer risk, for example while waiting outside physician consultation offices.

Authors’ Contributions

All authors read and approved the final manuscript. ND and T-WC developed the

study concept and design. MJ and CMO analysed and interpreted the data. ND, T-

WC, and DCW drafted the manuscript, and all authors provided critical revisions for

important intellectual content. The study was supervised by T-WC.


Conflicts of Interest

None declared

Abbreviations

BCC: basal cell carcinoma

CAT: computer adaptive testing

DIF: differential item functioning

NAT: non-adaptive testing

PCM: Partial Credit Model

ROC: receiver operating characteristic

RSM: Rating Scale Model

SCC: squamous cell carcinoma

SE: standard error

SEM: standard error of measurement


REFERENCES

1. Australian Institute of Health and Welfare & Australasian Association of

Cancer Registries. Cancer in Australia: an overview. Cancer series no. 74.

Cat. no. CAN 70. Canberra: AIHW2012. 2012.

URL:http://www.aihw.gov.au/WorkArea/DownloadAsset.aspx?id=601295

42353 [accessed 2015-11-20]

2. Narayanan DL, Saladi RN, Fox JL. Review: Ultraviolet radiation and skin

cancer. International Journal of Dermatology. 2010; 49(9):978-986.

3. World Health Organization. Global Solar UV Index: A Practical Guide.

Geneva: World Health Organization; 2002.

URL:http://www.who.int/uv/publications/en/GlobalUVI.pdf [accessed

2015-11-20]

4. Linacre JM. True-score reliability or Rasch statistical validity? Rasch

Measurement Transactions. 1996; 9(4), 455.

5. Rasch G. Probabilistic models for some intelligence and attainment tests.

Chicago: University of Chicago Press; 1960.

6. Lerdal A, Kottorp A, Gay CL, Grov EK, Lee KA. Rasch analysis of the

Beck Depression Inventory-II in stroke survivors: A cross-sectional study.

Journal of Affective Disorders. 4// 2014; 158(0):48-52.

7. Forkmann T, Boecker M, Wirtz M, et al. Development and validation of

the Rasch-based depression screening (DESC) using Rasch analysis and

structural equation modelling. Journal of Behavior Therapy and

Experimental Psychiatry. 9// 2009; 40(3):468-478.

8. Sauer S, Ziegler M, Schmitt M. Rasch analysis of a simplified Beck

Depression Inventory. Personality and Individual Differences. 2013;

54(4):530–535.

9. Chien TW, Wang WC, Huang SY, Lai WP, Chow CJ. A Web-Based

Computerized Adaptive Testing (CAT) to Assess Patient Perception in

Hospitalization. J Med Internet Res. 2011/08/15 2011;13(3):e61.

10. Ma SC, Chien TW, Wang HH, Li YC, Yui MS. Applying computerized

adaptive testing to the Negative Acts Questionnaire-Revised: Rasch

analysis of workplace bullying. Journal of Medical Internet Research.

2014; 16(2):e50.

11. Djaja N, Youl P, Aitken J, Janda M. Evaluation of a skin self examination

attitude scale using an item response theory model approach. Health and

Quality of Life Outcomes. 2014; 12(6).

12. Bjorner JB, Kreiner S, Ware JE, Damsgaard MT, Bech P. Differential

Item Functioning in the Danish Translation of the SF-36. Journal of

Clinical Epidemiology. 11// 1998; 51(11):1189-1202.

13. Ruo B, Choi SW, Baker DW, Grady KL, Cella D. Development and

Validation of a Computer Adaptive Test for Measuring Dyspnea in Heart

Failure. Journal of Cardiac Failure. 8// 2010; 16(8):659-668.

http://www.aihw.gov.au/WorkArea/DownloadAsset.aspx?id=60129542353%20

http://www.aihw.gov.au/WorkArea/DownloadAsset.aspx?id=60129542353%20

http://www.who.int/uv/publications/en/GlobalUVI.pdf


14. Williams MV, Davis T, Parker RM, Weiss BD. The role of health literacy

in patient-physician communication. Fam Med. 2002; 34(5):383-389.

15. Lee HY, Rhee TG, Kim NK. Cancer literacy as a mediator for cancer

screening behaviour in Korean adults. Health Soc Care Community. 2015;

May 14.

16. Kim K, Han HR. Potential links between health literacy and cervical

cancer screening behaviors: a systematic review. Psychooncology. 2015;

Jun 18.

17. Cutilli CC, Bennett IM. Understanding the Health Literacy of America

Results of the National Assessment of Adult Literacy. Orthop Nurs. 2009;

28(1): 27–34.

18. Olsen CM, Green AC, Neale RE, et al. Cohort profile: The QSkin Sun and

Health Study. International Journal of Epidemiology. August 1, 2012;

2012;41(4):929-929i.

19. Masters GN. A Rasch model for partial credit scoring. Psychometrika

1982; 47(2), 149-174.

20. Lai WP, Chien TW, Lin HJ, Su SB, Chang CH. A screening tool for

dengue fever in children. The Pediatric Infectious Disease Journal. 2013;

32(4):320-324.


Psychometrika. 1978; 43, 561-73.

22. Smith RM, Suh KK. Rasch fit statistics as a test of the invariance of item

parameter estimates. J Appl Meas 2003; 4(2):153-163.

23. Holland PW, Wainer H. Differential Item Functioning. Hillsdale. NJ:

Lawrence Erlbaum. 1993.

24. Tennant A, Pallant J. DIF matters: A practical approach to test if

Differential Item Functioning makes a difference. Rasch Measurement

Transactions. 2007; 20(4),1082-1084.

25. Linacre JM. RUMM2020 Item-Trait Chi-Square and Winsteps DIF Size.

Rasch Mea Trans. 2007; 21(1):1096.

26. Linacre JM. WINSTEPS. URL: http://www.winsteps.com/index.htm

[accessed 2014-03-27].

27. Linacre JM. Optimizing rating scale category effectiveness. J Appl Meas

2002; 3(1):85-106.

28. Stephan C, Wesseling S, Schink T, Jung K. Comparison of eight computer

programs for receiver-operating characteristic analysis. Clinical

Chemistry. Mar 2003; 49(3):433-439.

29. Bock RD, Aitkin M. Marginal maximum likelihood estimation of item

parameters: Application of an EM algorithm. Psychometrika. Dec 1981;

46(4):443-459.


30. Embretson SE, Reise SP. Item response theory for psychologists.

Lawrence Erlbaum Associates; 2000.

31. Linacre JM. Computer-Adaptive Tests (CAT), Standard Errors and

Stopping Rules. Rasch Measurement Transactions. 2006; 20(2):1062.

32. Hsueh IP, Chen JH, Wang CH, Hou WH, Hsieh CL. Development of a

computerized adaptive test for assessing activities of daily living in

outpatients with stroke. Physical Therapy. 2013; 93(5):681-693.

33. Linacre JM. How to Simulate Rasch Data. Rasch Measurement

Transactions. 2007; 21(3):1125.

34. Chien TW, Wu HM, Wang WC, Castillo RV, Chou W. Reduction in

patient burdens with graphical computerized adaptive testing on the ADL

scale: tool development and simulation. Health and Quality of Life

Outcomes. 2009; 7:39.

35. Wainer H, Dorans NJ, Flaugher R, Green BF, Mislevy RJ. Computerized

adaptive testing: A primer. Routledge; 2000.

36. Smith EV Jr. Detecting and evaluating the impact of multidimensionality

using item fit statistics and principal component analysis of residuals. J

Appl Meas. 2002; 3(2):205-231.

37. Chien T. Cronbach's alpha with the dimension coefficient to jointly assess

a scale's quality. Rasch Meas Trans. 2012; 26(3):1379

38. Robinson KJ, Gaber R, Hultgren B, et al. Skin Self-Examination

Education for Early Detection of Melanoma: A Randomized Controlled

Trial of Internet, Workbook, and In-Person Interventions. J Med Internet

Res. 2014/01/13 2014; 16(1):e7.

39. Brady MS, Oliveria SA, Christos P, et al. Patterns of detection in patients

with cutaneous melanoma. Cancer. 2000; Jul 15;89(2):342-347.

40. Berwick M, Begg C, Fine J, Roush G, Barnhill R. Screening for

Cutaneous Melanoma by Skin Self-Examination. JNCI Journal of the

National Cancer Institute. 1996; Jan 03;88(1):17-23

41. Karabatsos G. Comparing the aberrant response detection performance of

thirty-six person-fit statistics. Applied Measurement in Education. 2003;

16(4), 277-298.

42. Eack SM, Singer JB, Greeno CG. Screening for anxiety and depression in

community mental health: the Beck Anxiety and Depression Inventories.

Community Mental Health Journal. Dec 2008; 2008;44(6):465-474.

43. Shear MK, Greeno C, Kang J, Ludewig D, et al. Diagnosis of

nonpsychotic patients in community clinics. The American Journal of

Psychiatry. Apr 2000; 2000; 157(4):581-587.

44. Ramirez BM, Bostic JQ, Davies D, et al. Methods to improve diagnostic

accuracy in a community mental health setting. The American Journal of

Psychiatry. Oct 2000; 157(10):1599-1605.


45. De Beurs PD, de Vries LMA, de Groot HM, de Keijser J, Kerkhof JFMA.

Applying Computer Adaptive Testing to Optimize Online Assessment of

Suicidal Behavior: A Simulation Study. J Med Internet Res. 2014/09/11

2014; 16(9):e207.

46. Antal M, Imre A. Computerized adaptive testing: implementation issues.

Acta Univ. Sapientiae, Informatica, 2, 2, 2010; 168–183

47. Mitchell JS, Godoy L, Shabazz K, Horn BI. Internet and Mobile

Technology Use Among Urban African American Parents: Survey Study

of a Clinical Population. Journal of Medical Internet Research.

2014/01/13 2014; 16(1):e9.

Chapter 5: Diagnostic Discrimination of the Skin Cancer Risk (SCR) Scale: Application of Item Response

Theory 107

Diagnostic Discrimination of the Skin Cancer Risk (SCR) scale: Application of

Item Response Theory

Ngadiman Djaja1,2, Monika Janda1,2, Catherine M. Olsen3, David C. Whiteman1,2,3

1 School of Public Health and Social Work, Institute for Health and Biomedical

Innovation, Queensland University of Technology, Brisbane, Australia.

2 National Health and Medical Research Council Centre for Research Excellence in

Sun and Health (CRESH), Brisbane, Australia

3 QIMR Berghofer Medical Research Institute, Brisbane, Australia.

Citation

Djaja, N., M. Janda, C. M. Olsen and D. C. Whiteman (2015). Diagnostic

Discrimination of the Skin Cancer Risk (SCR) scale: Application of Item Response

Theory. International Outcome Measurement Conference. Chicago.

108 Chapter 5: Diagnostic Discrimination of the Skin Cancer Risk (SCR) Scale: Application of Item Response

Theory

ABSTRACT

Aims:

Queensland, Australia has the world’s highest incidence of skin cancer. Self-

administered scales are commonly used to measure risk factors such as phenotype,

sun exposure and sun protection, or overall skin cancer risk (SCR). We sought to

develop new scales for measuring skin cancer risk and calibrate it using PCM.

Subjects:

Prospective skin cancer risk cohort of 43794 men and women aged 40–69 years

randomly sampled from the population of Queensland, Australia.

Analysis:

Dimensionality of the scale and calibration of items were studied using the Partial

Credit Model. Receiver operating characteristics (ROC) curves analyses were used to

assess how well the final items predicted future development of skin cancer.

Results:

Four of twenty nine items had mean square values outside acceptable boundaries,

indicating item misfit. Item calibration found that item measures between -2.800 and

+1.950 logits on the SCR scale. Diagnostic discrimination showed area under the

curve (AUC) statistics of .753 (p < .000), .530 (p < .000) and .487 (p=0.093), for the

phenotype (PE), sun exposure (SE) and sun protection (SP) subscales, respectively.

Conclusion:

The results show unidimensional structure of each SCR subscale. Item calibration

shows they are distributed along the continuum. Only the PE subscale shows good

predictive discrimination.


Theory 109

INTRODUCTION

Queensland (Australia) has the highest rates of melanoma and other skin cancers in

the world [1, 2]. On average, two out of every three Australians will be treated for

skin cancer at some stage during their lives, and skin cancers form approximately 80

percent of all new cancers diagnosed [2]. In the USA, there 76,100 new cases of

melanoma and 9,710 deaths are estimated for 2014[3].

To appropriately stratify patients for management and counselling, doctors are

seeking tools to accurately estimate a person’s future risk of skin cancer. Many self-

administered questionnaires [4-6] have been developed. However, the measures often

do not appear to have been developed rigorously according to current psychometrics

standards such as validity, reliability, errors of measurement, norms, or score

comparability [7]. The few measures which reported their psychometric properties

used classical test theory [8-11]. The purpose of this paper was to apply a Partial

Credit Model (PCM) to establish optimal scale composition and establish predictive

validity to contribute evidence for the value of measuring skin cancer risk (SCR) by

self-report.

METHODS

Participants

Data was obtained from the QSkin Sun and Health Study prospective cohort study of

43,794 men and women aged 40–69 years randomly sampled from the population of

Queensland, Australia in 2011 [12]. The primary aim of the QSkin study is to

improve understanding of skin cancer risk. The QSkin study was approved by the

human research ethics committee of the QIMR Berghofer Medical Research

Institute.

Instruments

The present study used 29 items of the SCR scale for measuring three subscales: (1)

Phenotype/PH (twelve items), (2) Sun Exposure/SE (eleven items), and (3) Sun

Protection/SP (six items).A partial credit score coded the response to each item, with

score ranging from 0 (low risk) to 5 (high risk).


Theory

Procedure and Analysis

IRT analyses

ACER ConQuest software [13] was used to calibrate models to examine item-level

fit statistics for each of the SCR subscales separately; 2/3 of study population was

used for item calibration. In this study the Rasch PCM [14] was used because it is

most appropriate to modelling items with more than two ordered response categories.

Diagnostic discrimination

To investigate the diagnostic discrimination of the SCR scale, we used receiver

operating characteristics (ROC) curve analysis. For these purposes, item parameters

obtained from calibrated samples provided the anchor parameters to estimate SCR of

the validation sample (1/3 of study population, 13,178 persons; 11,528 people with

no KC and 1,650 (12.5 %) with reported KC) applying the risk score towards

correctly predicting the development of a new skin cancer.

RESULTS

Item fit

Unweighted and weighted mean square (MNSQ) statistics were calculated to

examine item fits. Adam and Kho [15] in Wilson [16] recommended these values

should be within the tolerance bounds of 0.7-1.3 Table 5.1 shows the fit statistics for

all items. Three itesm (SE1, SE2 and SE3 in the Sun Exposure (SE) subscale and one

item in the sun protection (SP6) subscale showed misfit, and were removed from

further analyses.

Chapter 5: Diagnostic Discrimination of the Skin Cancer Risk (SCR) Scale: Application of Item Response Theory 111

Table 5.1: Item parameter estimations and fit statistics of skin cancer risk (SCR) scale

Item Description Estimate Error Unweighted fit Weighted fit

MNSQ T MNSQ T

Phenotype (PH) scale

PH1 Sex -0.043 0.015 1.01 0.80 1.01 4.00

PH2 Skin colour -2.682 0.027 0.94 -5.50 0.96 -3.80

PH3 Skin burn ability -0.37 0.011 0.96 -4.20 0.96 -5.00

PH4 Skin tan -0.652 0.011 1.13 11.80 1.11 10.70

PH5 Eye colour -0.311 0.005 1.03 2.70 1.02 3.80

PH6 Hair colour 0.294 0.008 1.09 7.80 1.07 11.90

PH7 Freckles 0.49 0.01 0.96 -4.10 0.97 -3.50

PH8 Moles 0.48 0.014 1.01 0.80 1.01 1.10

PH9 Sunbeds use 0.977 0.022 1.10 9.40 1.02 0.60

PH10 Number of skin cancer cut off 0.76 0.014 0.90 -9.30 0.94 -5.30

PH11 Number of skin cancer frozen 0.291 0.006 0.91 -8.90 0.94 -5.20

PH12 Close blood have melanoma 0.767*

0.99 -1.30 0.99 -1.20

Sun exposure (SE) scale

SE1 Sunburn frequency when child 1.06 0.014 1.54 43.60 1.45a 33.30

SE2 Sunburn frequency when teenager 0.442 0.012 1.41 33.80 1.37 a 30.70

SE3 Sunburn frequency when adult 1.094 0.015 1.40 33.00 1.31 a 22.90

SE4 Outdoor duration – weekday – past year 0.585 0.009 0.99 -1.00 0.98 -2.00

SE5 Outdoor duration – weekday – age 10-19 -0.584 0.011 0.88 -11.40 0.89 -12.00

SE6 Outdoor duration – weekday – age 20-29 -0.036 0.009 0.81 -19.10 0.83 -19.70

SE7 Outdoor duration – weekday – age 30-39 0.248 0.009 0.83 -16.50 0.86 -16.20

112 Chapter 5: Diagnostic Discrimination of the Skin Cancer Risk (SCR) Scale: Application of Item Response Theory

SE8 Outdoor duration – weekend – past year -0.086 0.009 0.93 -6.70 0.93 -7.90

SE9 Outdoor duration – weekend – age 10-19 -1.272 0.014 0.81 -18.90 0.84 -15.10

SE10 Outdoor duration – weekend – age 20-29 -0.927 0.012 0.69 -32.30 0.72 -31.20

SE11 Outdoor duration – weekend – age 30-39 -0.526*

0.75 -25.80 0.76 -27.10

Sun Protection (SP) scale

SP1 SPF to face 0.487 0.017 0.83 -17.40 0.88 -16.70

SP2 SPF to hands -1.273 0.021 0.63 -40.50 0.84 -11.10

SP3 SPF to other body parts -2.227 0.027 0.59 -45.10 0.90 -4.50

SP4 Use of SPF 0.576 0.017 0.79 -21.00 0.85 -20.90

SP5 Sunscreen usage frequency last year 0.607 0.013 1.01 0.90 1.02 1.70

SP6 Hat usage frequency last year 1.830*

1.45 37.10 1.39 a 34.20 aItems beyond acceptable boundaries.


Theory 113

The item and person map in Figure 5.1 shows the calibration of PE subscale items

and the position of persons on the SCR continuum. The common logit scale is

represented on the vertical line in the centre of the map. An “X” in the person

column represents the position of a person on the skin cancer risk continuum; in this

large dataset, “X” represents a group of 103.6 persons

Figure 5.1: Items person map of PH subscale.


Theory

Figure 5.2: Items person map of SE subscale.


Theory 115

Figure 5.3: Items person map of SP subscale.


Theory

Person measures on the common logit scale make it possible to find out the most

likely response to the items composing the SCR scale. For example, in Figure 5.4,

Person P has a moderate PH score (PH =0.5). It may be observed that in item

PH3(PH3=-0.37), the most likely response of this person is option 3 (burn

moderately) and in item PH10 (PH10 = 0.76) the most likely response is option 3 (2-

10 skin cancers)

Figure 5.4: Example of most probable response for a person with skin cancer risk in

PH scale of 0.5 logits.

Category Probabilities Curves

The Category Probability Curves (CPC) represents the ability threshold parameters

of the item steps (m). These curves provide information on the functioning of the

alternative responses. The intersections between the curves (thresholds) define limits

Most probable

response

Item PH3(-0.37): burn

moderately

Item PH10 (0.76): 2-

10 skin cancers


Theory 117

of the “most probable response regions” on the scale continuum. As shown in Figure

5.5 (CPC of item PH3), all the response categories are the most probable in some

section of the continuum, which indicates that they are functioning properly. The

region of most probable response for persons with PH3 score of 0 is between - and

-1.99 logits; the most probable response is 1 for persons with score between -1.99

and -0.15 logits etc.

Figure 5.5: Category probability curves of item PH3

Validation sample characteristics

Table 5.2: Skin cancer risk score for each subscale in validation sample

Phenotype Sun Exposure Sun Protection

No KC

M (SD)

KC

M (SD)

No KC

M (SD)

KC

M (SD)

No KC

M (SD)

KC

M (SD)

p value p value p value

-.429

(.149)

-.303

(.158)

.350

(1.514)

.500

(1.489)

.253

(2.161)

.157(2.172)

p < 0.000 p < 0.000 p = 0.095

Skin cancer risk

2 = -

0.15

1 = -

1.99

3 = 1.04

Pr

ob

ab

ilit

y

R

1 R

2 R

3 R

4


Theory

Table 5.2 shows that people with KC have higher scores in the PE(p<0.000) and SE

subscales(p<0.000) and lower scores in the SP subscale (p=0.095) compared to

people with no KC.


The optimal cut-off score for all three subscales were assessed using ROC analysis.

Prediction of KC was used as the outcome variable. The area under the curve (AUC)

for the three subscales was .753 (p < .000), .530 (p < .000) and.487 (p=0.093),

indicating that the PE scale differentiated well between people who will or will not

develop a new KC, but the SE and SP scales had no or very fair predictive ability.

DISCUSSION

This study showed that an IRT calibrated PE subscale has good ability to predict

development of future skin cancers, while less can be gained from the SP or SE

subscales.

This study has some limitations. First, the study population consisted people from the

location with the highest incidence of skin cancer in the world; thus the calibrated

instrument may not be suit other populations. Secondly, the SP subscale only

measures use of sunscreen, but not other ways of protecting oneself from the sun. In

future studies, we aim to add more items, such as protective clothing and related

protective behaviours.

Figure 5.6: ROC curve


Theory 119

REFERENCES

1. Queensland Cancer Registry, Cancer in Queensland: Incidence and

Mortality, 1982 to 2007. Cancer Council Queensland: Brisbane, Australia.

2010.

2. Australian Institute of Health and Welfare, Australian Cancer Incidence and

Mortality (ACIM) Books. Canberra: Australian Institute of Health and

Welfare. 2012.

3. Siegel R, DeSantis C, Jemal A. Colorectal cancer statistics, 2014. CA: a

cancer journal for clinicians. 2014; Mar 1;64(2):104-17.

4. Mackie R, Freudenberger T, Aitchison TC. Personal risk-factor chart for

cutaneous melanoma. The Lancet. 1989; Aug 26;334(8661):487-90.

5. Tacke J, Dietrich J, Steinebrunner B, Reifferscheid A. Assessment of a new

questionnaire for self-reported sun sensitivity in an occupational skin cancer

screening program. BMC dermatology. 2008; Oct 24;8(1):1.

6. Weinstock MA. Assessment of sun sensitivity by questionnaire: validity of

items and formulation of a prediction rule. Journal of clinical epidemiology.

1992; May 31;45(5):547-52.

7. American Educational Research Association, American Psychological

Association, and National Council on Measurement in Education, Standards

for educational and psychological testing. 1999; Amer Educational Research

Assn.

8. de Troya-Martin M, Blázquez-Sánchez N, Rivas-Ruiz F, et al. Validation of a

Spanish questionnaire to evaluate habits, attitudes, and understanding of

exposure to sunlight:“the beach questionnaire”. Actas Dermo-Sifiliográficas

(English Edition). 2009; Dec 31;100(7):586-95.

9. Tripp MK, Carvajal SC, McCormick LK, et al. Validity and reliability of the

parental sun protection scales. Health Education Research. 2003; Feb 1;

18(1):58-73.

10. Glanz K, McCarty F, Nehl EJ, et al. Validity of self-reported sunscreen use

by parents, children, and lifeguards. American journal of preventive medicine.

2009; Jan 31; 36(1):63-9.

11. Hedges T, Scriven A. Young park users’ attitudes and behaviour to sun

protection. Global health promotion. 2010; Dec 1; 17(4):24-31.

12. Olsen CM, et al. Cohort profile: the QSkin sun and health study. International

journal of epidemiology. 2012; Aug 1; 41(4):929-i.

13. Adams R, Wu M, Wilson M. ACER ConQuest 3.0.1. ACER: Melbourne,

Australia. 2013.

14. Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982

Jun 1; 47(2):149-74.


Theory

15. Adams RJ, Khoo ST. Quest: the interactive test analysis system (Melbourne,

Australian Council for Educational Research). AdamsQuest: the interactive

test analysis system. 1996.

16. Conrad KJ, Wilson M. Constructing measures: An item response modeling

approach. Erlbaum Associates Mahwah, NJ. Evaluation and Program

Planning. 2005; Nov 30; 28(4):433-4.

Chapter 6: Development and Psychometric Evaluation of Item Banks for the Assessment of Skin Cancer Risk

Using Item Response Theory 123

Development and Psychometric Evaluation of Item Banks for the Assessment of

Skin Cancer Risk Using Item Response Theory

Ngadiman Djaja1,3,4*, David C. Whiteman1,4,6 , Philippa Youl1,4,5, Katherine M.

White2,3 , Michael Kimlin1,4,7 Monika Janda1,3,4

1 School of Public Health and Social Work, Faculty of Health, Queensland

University of Technology, Brisbane, Australia

2 School of Psychology and Counselling, Faculty of Health, Queensland University

of Technology, Brisbane, Australia

3 Institute of Health and Biomedical Innovation, Queensland University of

Technology, Brisbane, Australia

4 National Health and Medical Research Council Centre for Research Excellence in

Sun and Health, Institute of Health and Biomedical Innovation, Queensland

University of Technology, Brisbane, Australia

5 Cancer Council Queensland, Brisbane, Australia

6 QIMR Berghofer Medical Research Institute, Brisbane, Australia

7Health Research Institute (HRI), The University of the Sunshine Coast, Australia

124 Chapter 6: Development and Psychometric Evaluation of Item Banks for the Assessment of Skin Cancer

Risk Using Item Response Theory

ABSTRACT

Objective: Accurate assessment of skin cancer risk using self-reported questionnaire

items is important for epidemiological studies and appropriate targeting of

interventions. We assessed the psychometric properties of previously used items for

assessing skin cancer risk, and then evaluated those with good properties for their

reliability, stability and using item response theory.

Methods: A cohort of 1,177 participants aged 18-75 years living in Queensland,

Australia completed an online questionnaire between Winter and Spring 2015. The

questionnaire contained 51 items from a previously developed skin cancer risk item

bank. We assessed whether items measured risk on a unidimensional scale, and

whether item response categories represented increasing levels of risk. We examined

internal consistency using Cronbach’s alpha. To measure scale stability over time,

201 of these participants completed the questionnaire again within eight to ten

weeks. We measured the discriminative accuracy of the tool by calculating the area

under the receiver operating curve (AUC) of correctly identifying people with

previous self-reported melanoma or keratinocyte skin cancers.

Results: Three of 19 questions from the phenotype scale were removed due to misfit

with the model. All items from the sun exposure and sun protection subscales

showed good fit. Internal consistency was high (Cronbach alpha range: 0.73-0.89), as

was stability over time (retest coefficient: 0.74-0.95). Diagnostic discriminatory

accuracy was high for self-reported history of melanoma for the phenotype subscale

(AUC 0.72, 95% CI 0.65-0.78); moderate for the sun exposure scale (AUC 0.62,

95% CI 0.54-0.69); and low for the sun protection scale (AUC 0.36, 95% CI 0.29-

0.43). Similar discriminative ability scores were observed for self-reported non-

melanoma skin cancer (AUC phenotype scale 0.82, 95% CI 0.78-0.86; AUC sun

exposure scale 0.61, 95% CI 0.56-0.66; AUC sun protection scale 0.36, 95% CI,

0.31-0.41).

Conclusions: The new risk assessment scale for skin cancer derived performs well

and could be used in classical paper-pencil or computer adaptive assessment. Due to

its brevity and precision it may be an attractive tool for clinicians or researchers

seeking to measure personal skin cancer risk.



Keywords

Skin cancer, skin cancer risk, sun protection, sun exposure, Item Response Theory;

Partial Credit model, Rasch model, psychometrics



INTRODUCTION

Questionnaires (also commonly called surveys or scales), are frequently used in

health research to gather information on health-related behaviours, especially those

that are not easily directly observed. They are crucial for epidemiological studies,

and are also commonly used to evaluate the outcomes of health interventions. Before

a questionnaire can be used, however, its measurement properties (objectivity,

dimensionality, reliability, validity, non-differential item function, sensitivity to

change, and discrimination between known groups) must be demonstrated [1]. The

International Epidemiological Association European questionnaire group highlighted

the need to improve the quality of questionnaires administered especially for the

purpose of assessing risk factors, given that they provide important information

essential for health policy planning [2, 3].

In skin cancer prevention research, besides questionnaires, there are several other

commonly used data collection methods. These include sun diaries to record the time

outdoors or clothing worn [4], skin swabbing to assess whether sun screen has been

applied [5], direct observation of sun exposure or sun protection behaviours [6], or

ultraviolet radiation dosimeters [5, 7]. Compared to questionnaires these methods are

often more burdensome for the participants and researchers and are usually much

more costly. While questionnaires are convenient and cost-effective unfortunately to

date no standardised or commonly agreed upon questionnaire for measuring

behaviours related to skin cancer risk, sun exposure and sun protection is available.

On the contrary, many questionnaires used in intervention studies, population studies

or randomised control trials [8] have been developed de novo. They differ in content

and design, and many have not been validated formally [9]. There were some

previous efforts that aimed at providing a standard in measuring skin cancer related

behaviours. For example: Glanz [10] proposed standardised core survey items for the

measurement of sun exposure and sun protection practices for epidemiologic

research. Similarly, the National Human Genome Research Institute [11]

recommended a questionnaire to assess the main melanoma risk factors, such as

family history, number of nevi, sun exposure, freckling tendency and skin type [12-

14]. A recent systematic review analysed twenty-five risk models for the prediction

of melanoma with 144 possible risk factors identified [15] and only four validation

articles were included in the synthesis. While all models demonstrated good



discrimination, most did not assess psychometric properties. Therefore, the objective

of the current study was to develop a reliable and valid questionnaire to measure skin

cancer risk using Item Response Theory (IRT) models. IRT is a modern

psychometric approach that has rarely been applied for skin cancer risk questionnaire

development, although it is widely used in other disciplines such as educational [16,

17] and patient-reported outcome assessments [18, 19]. IRT models have several

advantages over classical test theory such as allowing more precise estimates of the

outcome of interest, confirmation of the unidimensionality of the outcome measure,

whether the response scale is used consistently by participants, ability to equate and

link different scales that measure the same underlying construct [20], providing

information about each individual item’s reliability, including whether an item has

bias toward certain group (called differential item functioning) [21], and allowing to

create computer adaptive tailored questionnaire presentation modes facilitating

economical assessment [22-24]. Thus, we used data from previous studies to select

the best performing questionnaire items to derive a new risk assessment scale

(SunAus scale), and then tested the psychometric properties of this scale in a newly

recruited sample of participants.

METHODS

Sample

This study was approved by the institutional ethics committee at Queensland

University of Technology (1200000553) and was undertaken in compliance with the

ethical guidelines of the National Health and Medical Research Council (NHMRC).

An online sample of 1,177 participants aged 18 years and older living in Queensland,

Australia was recruited during the southern hemisphere Winter (June-August) 2015.

During these months the ultraviolet radiation index commonly ranges between 4

(moderate) and 10 (very high) depending on latitude. Recruitment was conducted

through traditional and social media including email lists and a study Facebook page

(https://www.facebook.com/SunSurveyAustralia), Twitter, local radio and

newspaper, as well as an online research panel. Inclusion criteria were age ≥18 years

and currently living in Queensland. Exclusion criteria were: unable to access the

Internet, and problems with reading or understanding the English language. At the

end of their baseline online survey, participants recruited via university, social media



and local channel were invited to retake the survey a second time within eight to ten

weeks to examine test-retest reliability and stability over time.

Figure 6.1: Recruitment of participants.

Development of SunAus Scale

We started the development of the new scale by assessing the quality of items from

four previous studies conducted in Queensland which had captured information on

skin cancer risk factors using different items and instruments. The Melanoma

Screening Trial (MST) was a randomised trial of population screening for melanoma

in Queensland with a total of 3,110 participants (1,559 men and 1,551 women) [25].

The second study was the Queensland Cancer Risk Study (QCSR), a population-

based study of 9,419 Queensland residents aged 20-75 years that aimed to describe

the population prevalence of key cancer risk behaviours in Queensland [26]. The

third study entitled QSkin Sun and Health [27] (QSkin), was a population-based

cohort study of 43,794 men and women aged 40-69 years randomly sampled from

the population of Queensland, Australia, in 2011. Lastly, we used data from the

population-based cross-sectional study AusD Study (n=1,002 participants) [28]

(AusD). AusD was designed to assess vitamin D status and determinants across a

range of latitudes and seasons. It also aimed to identify the association between

participants’ attitudes about vitamin D and their self-reported changes to sun-

protection or exposure behaviours.

Baseline SunAus

participants

N = 1,177

University, social media,

local channel

n = 677

Online panel

n = 500

Test-retest within 8 weeks

n = 201



From each of the study questionnaires, we grouped items in the following three

subscales: phenotype, sun exposure behaviours, and sun protection behaviours. All

items were assessed for their item response theory properties including item fit and

category disordering (non-sequential categories) and only items with those good

psychometrics properties were retained for further development. An example of

category disordering was the eye colour question. We had different category answers

for this particular item. The QSkin study offered the options blue, grey, green, hazel,

brown, the MST and QCSR have the option blue or grey, green or hazel, brown or

black. An examination of category threshold showed that the analyst-assigned

category order does not accord with the underlying latent construct and the average

measures for each category are out of order/sequence [29]. This item showed no

category disordering after we combine them into three categories: brown or black,

green or hazel, blue or grey. More detailed information about disorder threshold can

be found in Andrich [30] and Adams [31]. After assessing item fit and category

disordering, proposed items were presented for content validation to four content

experts (two psychology researchers and two epidemiologists) with extensive

experience in skin cancer research [32]. The purpose of the discussion was to

eliminate redundant items and add new items not covered in previous studies. 240

items were reviewed, 30 items were selected, 15 items were changed, and 8 items

were added. Most of the changes to items were to make them more specific (e.g:

adding ‘volunteer/unpaid’ response category option to questions asking about main

occupation) or collapsing answer category (e.g., in eye colour questions: brown,

black, green, hazel, blue and grey become brown or black, green or hazel, blue or

grey). Overall, the new scale included 53 items measuring a broad range of possible

determinants of skin cancer. Complete item content can be seen in Table 1 and a full

description of the items are available in the appendix. These 53 items as well as

demographic information questions (year of birth, sex, education, employment status,

language usually speak at home, and ethnicity) were administered online.



Table 6.1: Overview of items measured on the SunAus Scale

Domain

Phenotype Item

ID

Sun Exposure Item

ID

Sun Protection Item

ID

Skin colour 1 Use of solarium 9 Sunscreen -

face

26

Skin type 2,3,4 Attempt to get a suntan 11 Sunscreen –

other body

parts

27

Hair colour 5 Lifetime sunburn 17 Frequency of

sunscreen use

28

Eye colour 6 Last 12 months sunburn 18 Sunscreen use

during last

weekend

29

Moles at 18 yrs 7 Lifetime sunburn – child 19a Wear a broad-

brimmed hat

30a

Freckles at 18

yrs

8 Lifetime sunburn -

teenager

19b Wear a cap 30b

Melanoma

status

12 Lifetime sunburn – adult 19c Wear any other

head covering

30c

NMSC status 13 Main occupation –

lifetime

20 Wear a shirt

with long

sleeves

30d

Number of skin

cancer that has

been cut-off

14 Main occupation – current 21 Wear long

trousers

30e

Number of

sunspots that

has been frozen

15 Weekdays sun exposure 22 Wear

sunglasses

30f

Close blood

relatives with

skin cancer

16 Weekends sun exposure 23 Stay in the

shade

30g



Moles larger

than 2mm

31 Weekdays sun exposure –

5 to 12 years

13 to 19 years

20 to 39 years

40 to 65 years

After 65 years

24a

24b

24c

24d

24e

Use an

umbrella

30h

Moles larger

than 5mm

32 Weekends sun exposure –

5 to 12 years

13 to 19 years

20 to 39 years

40 to 65 years

After 65 years

25a

25b

25c

25d

25e

Limit time in

the sun during

peak UV hours

30i

Ancestor

Father’s father

Father’s mother

Mother’s father

Mother’s

mother

33a

33b

33c

33d

Data Analysis

To derive the optimal items for testing, data from all four existing studies were

calibrated using the IRT model and misfit items were removed followed by expert

content review to determine final items. Figure 1 shows the data analysis process in

this study.



Figure 6.2: Steps in data analysis

Item response theory

Item response theory comprises a series of probabilistic models that describes the

relationship between a non-observable behaviour (also called a latent trait) and a

person’s response to each questionnaire question (also called item – hence the name

Item Response Theory) (see [33-35] for further information regarding item response

theory frameworks and estimation methods). The locations (threshold) along the

continuum of the latent trait values were estimated for each item. A commonly used

item response model called the Rasch Partial Credit Model (PCM) [36] was used for

the item calibration in our study as it allows for different thresholds for each category

across the items on the scale. A Rasch based item response theory software called

ACER ConQuest [37] was used to evaluate the psychometrics properties of our new

scale. The IRT analysis steps were undertaken twice; the first analysis was

MST data QCRS data QSkin data AusD data

IRT analysis of existing items

Items grouped into 3 domains

Focus group discussion and telephone interview

Final items: 53 items

IRT analysis of new items

Final items



undertaken to assess the characteristics of existing items from four studies and the

second analysis assessed the final items suggested by subject matter experts.

The following data quality parameters were assessed:

Assessment of unidimensionality

We assessed unidimensionality to identify the presence of additional explanatory

dimensions in the data, including fit statistics and principal components analysis

(PCA) of the Rasch residuals [34]. Mean square fit statistics provide summaries of

the Rasch residuals, responses that differ from what is predicted by the Rasch model,

for each person and item. A large number of misfitting items is an indication of

multidimensional construct (a single theoretical concept that is measured by several

related constructs). Unidimensionality assessment using principal components

analysis (PCA) of the Rasch residuals was defined as the first latent dimension that

explained at least 50% of the total variance and unexplained variance in the first

contrast (factor) explained less than 10% [38].

Assessment of item fit

To determine item fit statistics, Infit and Outfit Mean Square (MNSQ statistics) were

calculated. These fit statistics specify how well each item fits the Rasch partial credit

model and therefore helps to identify problematic items. Although there is no

commonly agreed criteria of infit and outfit mean square values, Wilson [33]

suggested these values should lie between 0.75 and 1.33 as an indication of good fit.

Items mean square statistics less than 1.0 (also called overfit) show that the Rasch

model predicts the data too well causing summary statistics (e.g: reliability) to report

inflated statistics On the other hand, mean square statistics greater than 1.0 (also

called underfit) show unpredictability and un-modelled noise indicating that there is

another source of variance in the data.

Assessment of item difficulty, content coverage and item targeting

Item person maps (also called The Wright map) were used to show item difficulty,

content coverage and item targeting for the new scale [33]. Wright maps consist of

two vertical histograms (see Figure 2). The left hand side of histogram shows the

distribution of the measured latent trait (e.g. skin cancer risk) of the participants most

at risk (x’s located at the top left of the map) to least at risk (x’s located at the bottom



left of the map). The right hand side of the histogram shows the distribution of the

items from the most difficult (high risk items/answer categories) at the top to the

least difficult (low risk items) at the bottom. The Wright map also can be used to

assess content coverage and item targeting of a scale by visually inspecting whether

items are spread nicely along the latent trait.

Assessment of validity and reliability

For concurrent validation, correlation between scores from the newly developed

questionnaire and two previously published questionnaires were calculated. The first

previously published questionnaire measured sun exposure and sun protection habits

(The sun protection habits scale) [10] and the second questionnaire measured

phenotypical risk of skin cancer (the 7-item skin cancer protocol from PhenX

Measures) [11]. Both questionnaires were administered online. Although there were

previous researches [5, 10, 39-42] on validation, no comprehensive study has been

done to assess the internal consistency, test-retest reliability, concurrent validity and

criterion validity of either of these previously published questionnaires. We expected

that the core measures of sun exposure and sun protection habits would correlate at

least moderately with our new sun exposure and sun protection subscales. Likewise,

the PhenX measures should correlate highly with our new phenotype subscale.

For assessment of reliability, we calculated internal consistency by calculating

Cronbach's Alpha coefficient and person separation reliability. To test questionnaire

stability over time, we used data from 201 participants who completed the

questionnaire a second time at eight to ten weeks. Coefficient of test stability was

calculated using Pearson’s product moment correlations between baseline and retest

logits (the mathematical unit of Rasch measurement and are termed locations instead

of scores.

METHODS

Characteristics of the study sample

Sociodemographic characteristics of the 1,177 study participants are presented in

Table 2. Participants were aged between 18 and 75 years (median of 37 years) old,

most participants were female (76.3%), born in Australia (73.0%), Caucasian



(82.4%), indoor workers (79.9%), almost half were university educated (41.8%) and

one third were full-time workers (32.4%).

The majority of participants reported medium or fair skin and brown or lighter hair.

Over 60% reported green or blue eyes.

Table 6.2: Characteristics of Study Participants (N=1,177) and 2011 Australia census

– Queensland (QLD) State only [43]

Characteristics Participants QLD population

No. % No. %

Sex

Female 898 76.3

2,184,519 50.4

Male 272 23.1 2,148,220 49.6

Missing 7 0.6

Age

Less than 25 years 340 28.9 1,463,625 33.8

25 - 34 years 292 24.8 587,406 13.6

35 - 44 years 211 17.9 620,750 14.3

45 - 54 years 195 16.6 590,886 13.6

More than 55 years 139 11.8 1,070,072 24.7

Median age, years (range) 37 18-75

Country of birth

Australia 859 73.0

Other Countries 312 26.5

Missing 6 0.5

0.5

Ethnic origin

Caucasian 970 82.4

Other 199 16.9

Missing 8 0.7

Mother’s father ethnicity

Asia and Middle East 187 15.9

S.Europe, E.Europe, N.Europe 201 17.1

Scotland and England 170 14.4

Australia and New Zealand 595 50.6

Missing 24 2.0

Mother’s mother ethnicity





Missing 22 1.9

Father’s father ethnicity







Missing 24 2.0

Father’s mother ethnicity





Missing 26 2.2

Highest qualification

No school certificate or other

qualification

22 1.9

School or intermediate certificate 76 6.5

Higher school or leaving certificate 289 24.6

Trade / apprenticeship 61 5.2

Certificate / diploma 228 19.4

University degree or higher 492 41.8

Missing 9 0.8

Employment status

Full-time worker 458 32.4

Part-time worker 297 21.0

Home duties 74 5.2

Unemployed 70 5.0

Student 410 29.0 Retired 69 4.9 Other 36 2.5 Indoor/outdoor work

Mainly indoors 941 79.9

Half indoors and half outdoors 191 16.2

Mainly outdoors 36 3.1

Missing 9 0.8

Self-reported skin colour

Black 8 0.7

Olive/Brown 153 13.0

Medium 323 27.4

Fair 691 58.7

Missing 2 0.2

Natural hair colour

Black 211 17.9

Brown 668 56.8

Blonde 221 18.8

Red 46 3.9

Missing 31 2.6

Eye colour

Brown or Black 434 36.9

Green or Hazel 364 30.9

Blue or Grey 377 32.0

Missing 2 0.2



A comparison of the SunAus study cohort with the 2011 Queensland census data

[43] showed that SunAus participants were more likely than Queensland population

to be female (76.3% vs. 50.4%) and between 25 -34 years (24.8% vs. 13.6%).

Unidimensionality

The unidimensional assumption was met for the phenotype and sun exposure

subscales. The first factor explained 65.3% for the phenotype subscale (with 6.3%

unexplained variance in the first contrast), and 50.5% for the sun exposure subscale

(with 7.9 % unexplained variance in the first contrast), respectively. However, the

sun protection subscale did not meet the assumption of unidimensionality as its first

factor dimension only explained 44.3% of the variance and 10.5 % unexplained

variance in the first contrast.

Assessment of item fit

Tables 3–5 shows item location and item fit of the skin cancer risk subscales. Two

items (SCR31 (Outfit Mnsq=1.55) and SCR32 (Outfit Mnsq =1.44) from the

phenotype subscale) had weighted fit indices >1.33, and likely did not contribute to

the scale’s ability to differentiate participants’ skin cancer risk [33]. These items

asked about the number of moles on the left upper arm (SCR31= larger than 2mm

and SCR32 = larger than 5mm). After removing these items and recalibrating the

scale, we found item 7 (SCR07) “When you were 18 years age, how many moles did

you have on your skin?” had fit indices of 1.39 and was removed as well. All other

items in the three subscales showed good fit.



Table 6.3: Item parameter estimations and fit statistics of phenotype scale

Item

Order Item Code Estimate

INFIT OUTFIT

MNSQ T MNSQ T

1 SCR01 -2.428 0.98 -0.5 1.02 0.4

2 SCR02 -1.388 0.97 -0.8 0.97 -0.8

3 SCR03 -0.442 1.03 0.7 1.01 0.2

4 SCR04 -0.191 1.12 2.8 1.10 2.6

5 SCR05 0.159 0.88 -2.9 0.88 -2.8

6 SCR06 -0.374 0.89 -2.6 0.92 2.7

7* SCR07 0.431 1.12 2.7 1.11 .4

8 SCR08 0.667 1.02 0.4 0.99 -0.3

9 SCR12 2.665 0.96 -1.0 0.99 -0.1

10 SCR13 1.788 0.81 -4.8 0.94 -0.8

11 SCR14 1.270 0.86 -3.4 0.97 -0.3

12 SCR15 0.982 1.10 2.3 1.06 0.6

13 SCR16 0.034 0.86 -3.4 0.91 -4.5

14* SCR31 0.301 2.85 30.4 1.55 9.1

15* SCR32 0.703 5.40 54.9 1.44 3.9

16 SCR33a -1.022 0.86 -3.5 0.81 -5.1

17 SCR33b -1.048 0.78 -5.8 0.78 -5.9

18 SCR33c -1.035 0.81 -4.8 0.79 -5.7

19 SCR33d -1.072 0.80 -5.2 0.76 -6.1

Mean Square fit statistic MNSQ is the item goodness-of-fit statistics of the Rasch model. The estimate

column represents item logits that indicate the difference between the mean item measure for 19 items

and the item measure for each item.

* misfit item



Table 6.4: Item parameter estimations and fit statistics of sun exposure behaviours

scale

Item

Order Item Code Estimate

INFIT OUTFIT

MNSQ T MNSQ T

1 SCR9 2.038 1.20 4.6 1.04 0.5

2 SCR11 0.808 1.16 3.6 1.11 2.8

3 SCR17 -0.800 1.22 4.9 1.20 4.9

4 SCR18 1.409 1.05 1.3 1.13 1.7

5 SCR19a -0.008 1.14 3.1 1.13 3.4

6 SCR19b -0.716 0.99 -0.2 0.99 -0.1

7 SCR19c 0.377 1.07 1.5 1.06 1.3

8 SCR20 1.117 0.99 -0.2 0.97 -0.6

9 SCR21 1.466 0.99 -0.3 1.00 0.1

10 SCR22 0.651 0.97 -0.6 0.95 -1.3

11 SCR23 -0.640 0.97 -0.8 0.97 -0.8

12 SCR24a -1.089 1.05 1.2 1.05 1.3

13 SCR24b -0.946 0.90 -2.3 0.91 -2.6

14 SCR24c 0.095 0.82 -4.2 0.86 -3.7

15 SCR24d 0.377 0.84 -2.6 0.87 -2.3

16 SCR24e 0.018 0.89 -0.7 0.94 -0.5

17 SCR25a -1.656 1.01 0.3 1.00 -0.0

18 SCR25b -1.447 0.86 -3.5 0.86 -4.0

19 SCR25c -0.772 0.83 -4.1 0.84 -4.5

20 SCR25d -0.222 0.83 -2.7 0.84 -2.9

21 SCR25e -0.057 0.91 -0.5 0.95 -0.3






Table 6.5: Item parameter estimations and fit statistics of sun protection behaviours

scale

Item

Order

Item

Code Estimate

INFIT OUTFIT

MNSQ T MNSQ T

1 SCR26 0.084 0.91 -2.3 0.92 -4.2

2 SCR27 -1.432 0.74 -6.8 0.91 -1.5

3 SCR28 0.126 1.00 0.1 1.03 0.8

4 SCR29 -0.323 0.85 -3.8 0.90 -4.0

5 SCR30a -0.087 1.02 0.6 1.02 0.6

6 SCR30b -0.030 1.22 5.0 1.17 4.3

7 SCR30c -0.689 1.08 1.8 1.06 0.9

8 SCR30d -0.151 0.89 -2.7 0.90 -2.5

9 SCR30e 0.172 1.10 2.4 1.07 1.8

10 SCR30f 1.146 1.15 3.6 1.14 3.7

11 SCR30g 1.117 0.98 -0.5 0.98 -0.6

12 SCR30h -0.817 0.84 -4.1 0.91 -1.4

13 SCR30i 0.884 1.03 0.7 1.03 0.8




Assessment of item difficulty, content coverage and item targeting

The Wright map for the phenotype subscale (Figure 2) shows that respondents’

answers to the questions placed them between -3.5 and +2.5 logits (also called the

log-odds). Participants with high skin cancer risk based on their phenotypical

characteristics are located at around 0.90 logits, and those who had the lowest skin

cancer risk were located at the scale around -2.50 logits. Few respondents were found

to be at either extreme. Most respondents were located at logits between -1.00 and

+0.50.

Inspection of the Wright map also shows no evidence of ceiling or floor effects, with

all participants located within the lowest and highest risk item. This result means that

the scale has good content coverage (spread) of the latent construct being measured,



and all items were targeting the latent construct well. The easiest item is item 1.1

(SCR01): “How would you rate your natural skin colour on areas never exposed to

the sun (on the underside of your arm)?” located at > -3 logit. Meanwhile the most

difficult item is item 9 (SCR12): “Have you ever been diagnosed with melanoma?”

located > +2 logit. The Wright maps for the other two subscales are presented in the

supplementary file and revealed similar patterns although the sun protection

behaviour subscale had few items. Items of both subscales were spread well across

the latent construct continuum, suggesting that the content matched the distribution

of the participants.



Figure 6.3: The Wright map for phenotype subscale.



The map represents the relationship between person risk and item difficulty measure

in logits. Each “X” on the left side of the map represents 8.3 participants. The labels

on the right side of the map represents show the levels of item, and step, respectively.

The Wright map shows Thurstonian thresholds for each of the items. The notation

x.y is used to indicate the y-th threshold of the x-th item. As an example, the red

circle represents item 5 (hair colour) with four answer categories (black, brown,

blonde and red). Item 5 has three threshold (number of answer categories-1) and 5.1

represents item 5 with threshold 1

Assessment of standard error of measurement

Figure 3 shows the standard error of measurement at each logit along the Rasch scale

continuum. The Phenotype subscale had low measurement error for respondents

between 0 and +1 logit and high error for respondents at <0 logit. The sun exposure

behaviour subscale had low measurement error for respondents at each logit except

for those at logits of -2.00 or lower (i.e., people with low exposure to the sun. For the

sun protection behaviour subscale, the standard error of measurement was high for

respondents at greater than 1 logit. This result means that the new scale best

measures people with moderate skin cancer risk.

a) b)



Figure 6.4: Distribution of Standard Error Measurement for each domain: a)

phenotype, b) sun exposure, c) sun protection

c)



Concurrent Validity

Concurrent validity was assessed by correlating the person’s risk scores/person’s

location (in item response theory called theta scores) of each subscale with the total

scores obtained for the core measures of sun exposure and sun protection habits [10],

and the PhenX [11], and are reported in Table 6. All three subscales of the skin

cancer risk scale were significantly correlated with both the measures of sun

exposure and sun protection habits and PhenX measures. The result also show that

the phenotype subscale has the highest correlation (r=0.77, p< .000) with the PhenX

measure and the sun exposure subscale has the lowest but still substantial correlation

(0.52, p< .000) with Glanz’s measure of sun exposure habits.

Table 6.6: Correlation between the SunAus scale, core measure of sun exposure and

sun protection habits and PhenX measure

SunAus scale

Phenotype Sun

Exposure

Sun

Protection

PhenX 0.77*

(N=877**)

Core measures of sun exposure

habits

0.52*

(N=999**)

Core measures of sun protection

habits

0.62*

(N=978**)

* p< .000 (2-tailed)

** Sample varies between each of the scales

Test Reliability

Internal consistency for all subscales was good with Cronbach's alpha coefficients of

0.86, 0.88 and 0.73 for the phenotype, sun exposure and sun protection subscales,

respectively. Person separation reliability also showed similar reliability indices for

the three subscales (0.86; 0.82 and 0.72), respectively.

When we examined stability over time, we found high test-retest reliability at eight

weeks after the first assessment, including for the phenotype (r=0.95, n=201, p <



0.00), sun exposure behaviour (r=0.80, n=201, p < 0.00), and sun protection

behaviour (r=0.74, n=201, p < 0.00) subscales.


The optimal cut-off scores for all three subscales were assessed using ROC analysis,

based on participants self-reported diagnosis of a melanoma (“Have you ever been

diagnosed with melanoma?”) or non-melanoma skin cancer (“Have you ever been

diagnosed with other sorts of skin cancer (keratinocyte cancer, basal cell carcinoma,

or squamous cell carcinoma)?” as the outcome. The area under the curve (AUC) for

melanoma were 0.72 (95% CI, 0.65-0.78), 0.62 (95% CI, 0.54-0.69) and 0.36 (95%

CI, 0.29-0.43) for the phenotype, sun exposure and sun protection subscales,

respectively; and 0.82 (95% CI, 0.78-0.86), 0.61 (95% CI, 0.56-0.66) and 0.36 (95%

CI, 0.31-0.41), respectively, for non-melanoma skin cancer.

Figure 6.5: ROC Curves of outcome variables

Conversion of raw scores to Rasch-scaled scores (theta scores)

We have provided a conversion table to score the SunAus scale for other researchers

who wish to use the SunAus scale and also gain the interval scoring benefits of

Rasch analysis, without performing Rasch analysis themselves. The tables convert

the raw (ordinal) SunAus scores to Rasch measurement estimates. These tables and

the questionnaire can be obtained by contacting the corresponding author by email or

visit our study website (www.sunaus.org).

http://www.sunaus.org/



DISCUSSION

To date, many different skin cancer risk scales have been developed in Australia and

worldwide, but to our knowledge, no prior instruments were developed by

systematically assessing the performance and discriminant ability of each individual

item within the scale. Identifying individuals at high risk of developing skin cancer is

important for future studies, as well as counselling and prevention, and may also aid

early detection of skin cancer [44, 45].

Our risk prediction model differs from previous models which focused mainly on

phenotypic factors and used logistic regression models to develop their prediction

algorithms [46-49]. We used item response theory to assess skin cancer risk and

validate it against commonly used measures of phenotype and sun exposure. This

method is becoming increasingly popular in clinical research and health outcome

research [50-54]. In contrast to questionnaires developed using classical

psychometric approaches in which reliability and validity are only calculated for the

whole scale, IRT allows the assessment of each item’s individual contribution. This

method can significantly improve the reliability and accuracy of measurement while

providing significant reductions in assessment time through implementation of

computer adaptive testing [55].

The analyses found that 50 items had good fit with the model, and provided adequate

estimates of the underlying latent construct. The analyses furthermore suggested that,

from a statistical perspective, the three items used to assess participants’ mole counts

and sizes did not fit within the underlying construct. The misfit may have occurred

due to the way the questions were constructed, or due to either random or systematic

misclassification. After removing the misfitting items, the content of the three

subscales provided reasonable content coverage for all respondents as shown by

item-person map, with moderate to high concurrent validity of the subscales with

previously validated questionnaires. Future iteration of the SunAus scale will need to

use better item measuring moliness, as this is commonly seen as one of the most

important risk factors for skin cancer [13, 14, 46, 48, 56].

Our assessment of test-retest reliability of the SunAus scale showed high stability

over time, ranging from 0.95, 0.80, and 0.74 for the phenotype, sun exposure

behaviour and sun protection behaviour subscale respectively. Our study yielded



similar reliabilities to the earlier Glanz study [40], despite using different methods to

calculate the coefficient of reproducibility.

In terms of diagnostic discriminatory accuracy compared to participants’ self -

reported history of skin cancer in the past, the best performing subscale was the

phenotype subscale (AUC 0.72), followed by sun exposure behaviour subscale (AUC

0.62). The discrimination accuracy of phenotype subscale we observed was higher

than previous studies done by Cho [48] with the area under the curve (AUC) of 0.62

(95% CI, 0.58-0.65) and slightly higher than Vuong [57] with AUC of 0.70 (95% CI,

0.67-0.73). This result suggests that the phenotype and sun exposure behaviour

subscales differentiate well between people who did or did not have melanoma and

non-melanoma skin cancer in the past. In contrast, the sun protection behaviour

subscale shows no discriminant ability, which may be caused by lack of evidence of

unidimensionality or may reflect a true independence of whether or not people use

sun protection and skin cancer at least in a high UVR environment such as

Queensland. Further investigation of items that better fit the construct is needed

While previous measures including the QSkin questionnaire have shown excellent

reliability and predictive ability (Chapter 5), the SunAus scale had similar

discriminatory performance when compared with existing model [58] and this study

shows improvement in several aspects of the scale. First, the AusSun scale had more

content coverage (more items) compared to previous measures. Second, use of theta

scores as a personal risk score has advantages over risk score using odds ratio [12] as

it gives the exact location of individuals and each item as well on skin cancer risk

continuum. Third, SunAus scale can be compared across studies through methods

such as scale linking and equating, once common/anchor items across studies are

established. Finally, the SunAus scale can predict risk with smaller measurement

error while using fewer items by incorporating computer adaptive testing approaches

which selectively choose the best performing items in sequence, depending upon a

person’s responses to preceding items as demonstrated in previous study using QSkin

data [55]. This approach can reduce respondents’ burden and be more economical to

implement.

Our study had several limitations. First, we used self-reported skin cancer status as

the outcome variable to examine diagnostic discrimination accuracy of the scale.

This can be improved in the future by conducting prospective studies with objective



outcome data from clinics or cancer registries. Second, due to time and funding

limitations, we did not validate our scale against objective measures such as

dosimeter UV readings or sunscreen swabbing. Future research should integrate

these objective measures in the validation process. Third, the sun protection

behaviour domain showed lack of evidence of unidimensionality. This finding means

that there is at least one other dimension measured by the scale, which could not be

identified in this study. Thus, future studies will need to compare the subscales

against objective outcomes such as skin colour spectrometry or sun screen swabbing

and determine better items that fit with the sun protection domain [42, 59]. Fourth,

even though our sample size was large (n=1,177), it consisted of Queensland

residents only. We expect that with a larger sample from more diverse geographic

locations, we may have observed larger variation in phenotype, sun exposure or sun

protection behaviours. Lastly, although beyond the scope of this study, testing the

invariance of the scale factor structure between different groups would allow us to

further refine the scale and provide evidence of its applicability in different

populations.

CONCLUSIONS

This work presented a comprehensive set of research studies culminating in the

development of a new SunAus skin cancer risk scale measuring phenotype, sun

exposure behaviour and sun protection behaviours based on the best items selected

from previous scales. The scale can serve as the framework to develop an

international standard measurement tool for skin cancer risk assessment, and could

also be used to develop a computer adaptive test for use in research and public health

practice.

Competing interests:

The authors declare they have no competing interests.

Acknowledgements

Ngadiman Djaja is supported by the National Health and Medical Research Council

of Australia (NHMRC) CRESH PhD scholarship. The authors are deeply grateful for

the support by Associate Professor Peter Newcombe, Amanda Weaver and QUT

Media staff during data collection.



Supplementary content

Scoring of the Skin Cancer Risk Scale

A score conversion table can be used to convert raw score to Rasch (theta) score. A

response to the lowest category scores 0 and each subsequent category scores an

additional 1 point until last categories. The maximum score for each item depend on

the number of categories on that particular item. To use the conversion table, simply

sum the score of each item in each scale and refer to the corresponding table to find

corresponding theta score.



Table 6.7: Supplement 1: Table for conversion of phenotype scale summed item

scores to Rasch measures

Score Theta SE

Score Theta SE

0 -6.21624 1.73465

31 1.35226 0.36688

1 -4.66725 1.04120

32 1.47564 0.36187

2 -3.82164 0.80700

33 1.59523 0.36032

3 -3.26776 0.67548

34 1.71454 0.36239

4 -2.86424 0.58914

35 1.83616 0.36838

5 -2.55017 0.52733

36 1.96309 0.37884

6 -2.29540 0.48076

37 2.09878 0.39526

7 -2.08246 0.44491

38 2.24931 0.42010

8 -1.89986 0.41707

39 2.42589 0.45844

9 -1.73950 0.39544

40 2.65208 0.51947

10 -1.59558 0.37847

41 2.98134 0.62728

11 -1.46360 0.36564

42 3.54202 0.84140

12 -1.34021 0.35616

43 4.76171 1.49072

13 -1.22274 0.34956

14 -1.10901 0.34552

15 -0.99712 0.34378

16 -0.88533 0.34420

17 -0.77208 0.34668

18 -0.65566 0.35116

19 -0.53440 0.35754

20 -0.40656 0.36572

21 -0.27038 0.37533

22 -0.12428 0.38596

23 0.03283 0.39660

24 0.20112 0.40591

25 0.37924 0.41213

26 0.56326 0.41321

27 0.74632 0.40817

28 0.91992 0.39812

29 1.07829 0.38614

30 1.22125 0.37517



Table 6.8: Supplement 2: Table for conversion of sun exposure behavior scale

summed item scores to Rasch measures

Score Theta SE

Score Theta SE

0 -5.27595 1.47207

31 0.88249 0.33971

1 -4.11926 0.87104

32 0.99858 0.34156

2 -3.54991 0.69122

33 1.11579 0.34341

3 -3.15451 0.59780

34 1.23375 0.34539

4 -2.84381 0.53924

35 1.35211 0.34779

5 -2.58325 0.49877

36 1.47086 0.35104

6 -2.35588 0.46874

37 1.59051 0.35581

7 -2.15206 0.44553

38 1.71241 0.36286

8 -1.96578 0.42702

39 1.83890 0.37314

9 -1.79311 0.41187

40 1.97354 0.38791

10 -1.63131 0.39924

41 2.12162 0.40891

11 -1.47840 0.38851

42 2.29121 0.43880

12 -1.33292 0.37926

43 2.49504 0.48180

13 -1.19374 0.37130

44 2.75405 0.54554

14 -1.05998 0.36432

45 3.10653 0.64666

15 -0.93091 0.35820

46 3.64318 0.83571

16 -0.80593 0.35284

47 4.77595 1.44204

17 -0.68450 0.34817

18 -0.56613 0.34415

19 -0.45037 0.34075

20 -0.33677 0.33795

21 -0.22495 0.33571

22 -0.11446 0.33405

23 -0.00489 0.33294

24 0.10415 0.33236

25 0.21306 0.33230

26 0.32222 0.33270

27 0.43197 0.33354

28 0.54265 0.33474

29 0.65452 0.33624

30 0.76778 0.33793



Table 6.9: Supplement 3: Table for conversion of sun protection behavior scale

summed item scores to Rasch measures.

Score Theta SE

Score Theta SE

0 -3.57570 1.42140

31 1.99349 0.55155

1 -2.45271 0.80219

32 2.34286 0.64880

2 -1.95418 0.61143

33 2.87682 0.83638

3 -1.64207 0.51348

34 4.01705 1.44694

4 -1.41526 0.45389

5 -1.23512 0.41363

6 -1.08340 0.38496

7 -0.95034 0.36376

8 -0.83016 0.34772

9 -0.71939 0.33520

10 -0.61554 0.32553

11 -0.51687 0.31819

12 -0.42221 0.31264

13 -0.33047 0.30872

14 -0.24074 0.30626

15 -0.15202 0.30524

16 -0.06346 0.30557

17 0.02586 0.30724

18 0.11692 0.31027

19 0.21070 0.31464

20 0.30823 0.32043

21 0.41061 0.32743

22 0.51871 0.33576

23 0.63339 0.34532

24 0.75536 0.35614

25 0.88526 0.36860

26 1.02403 0.38301

27 1.17340 0.40040

28 1.33656 0.42235

29 1.51930 0.45142

30 1.73192 0.49183



Each 'X' represents 7.9 cases

The labels for thresholds show the levels of item, and step, respectively

Figure 6.6: A1. Sun exposure behaviours scale item map



Each 'X' represents 7.5 cases

The labels for thresholds show the levels of item, and step, respectively

Figure 6.7: A2. Sun protection behaviours scale item map



REFERENCES

1. American Educational Research Association (AERA), American

Psychological Association (APA), and National Council on Measurement in

Education (NCME), The Standards for Educational and Psychological

Testing. 1999.

2. Olsen J. Epidemiology deserves better questionnaires. International Journal

of Epidemiology. 1998; Dec 1; 27(6):935.

3. Wilcox AJ. The quest for better questionnaires. American journal of

epidemiology. 1999; Dec 15; 150(12):1261-2.

4. Cargill J, et al. Validation of brief questionnaire measures of sun exposure

and skin pigmentation against detailed and objective measures including

vitamin D status. Photochemistry and photobiology. 2013; Jan 1; 89(1):219-

26.

5. O’Riordan DL, Glanz K, Gies P, Elliott T. A Pilot Study of the Validity of

Self‐reported Ultraviolet Radiation Exposure and Sun Protection Practices

Among Lifeguards, Parents and Children. Photochemistry and photobiology.

2008; May 1; 84(3):774-8.

6. Shoveller JA, Savoy DM, Roberts RE. Sun protection among parents and

children at freshwater beaches. Canadian Journal of Public Health/Revue

Canadienne de Sante'e Publique. 2002; Mar 1;146-8.

7. O’Riordan DL, Steffen AD, Lunde KB, Gies P. A day at the beach while on

tropical vacation: sun protection practices in a high-risk setting for UV

radiation exposure. Archives of dermatology. 2008; Nov 17; 144(11):1449-55.

8. Youl PH, Soyer HP, Baade PD, Marshall AL, Finch L, Janda M. Can skin

cancer prevention and early detection be improved via mobile phone text

messaging? A randomised, attention control trial. Preventive medicine. 2015;

Feb 28; 71:50-6.

9. Hillhouse J, Turrisi R, Jaccard J, Robinson J. Accuracy of self-reported sun

exposure and sun protection behavior. Prevention Science. 2012; Oct 1;

13(5):519-31.

10. Glanz K, et al. Measures of sun exposure and sun protection practices for

behavioral and epidemiologic research. Archives of Dermatology. 2008; Feb

1; 144(2):217-22.

11. National Human Genome Research Institute. PhenX Measure : Skin Cancer

2010 [cited 2015 1 June ]; Available from:

https://www.phenxtoolkit.org/toolkit_content/PDF/PX170601.pdf.

12. Quereux G, et al. Development of an individual score for melanoma risk.

European Journal of Cancer Prevention. 2011; May 1; 20(3):217-24.

13. Mar V, Wolfe R, Kelly JW. Predicting melanoma risk for the Australian

population. Australasian Journal of Dermatology. 2011; May 1; 52(2):109-

16.



14. Fortes C, et al. Identifying individuals at high risk of melanoma: a simple

tool. European Journal of Cancer Prevention. 2010; Sep 1; 19(5):393-400.

15. Usher-Smith JA, Emery J, Kassianos AP, Walter FM. Risk prediction models

for melanoma: a systematic review. Cancer Epidemiology Biomarkers &

Prevention. 2014; Jun 3:cebp-0295.

16. Schulz W. Validating Questionnaire Constructs in International Studies: Two

Examples from PISA 2000. 2003.

17. Gonzalez EJ, Galia J, Li I. Scaling methods and procedures for the TIMSS

2003 mathematics and science scales. TIMSS. 2003; 252-73.

18. Chakravarty EF, Bjorner JB, Fries JF. Improving patient reported outcomes

using item response theory and computerized adaptive testing. The Journal of

Rheumatology. 2007; Jun 1; 34(6):1426-31.

19. Flynn KE, Dombeck CB, DeWitt EM, Schulman KA, Weinfurt KP. Using

item banks to construct measures of patient reported outcomes in clinical

trials: investigator perceptions. Clinical Trials. 2008; Dec 1; 5(6):575-86.

20. Cook LL, Eignor DR. IRT equating methods. Educational measurement:

Issues and practice. 1991; Sep 1; 10(3):37-45.

21. Teresi JA. Different approaches to differential item functioning in health

applications: Advantages, disadvantages and some neglected topics. Medical

care. 2006; Nov 1; 44(11):S152-70.

22. Dodd BG, De Ayala RJ, Koch WR. Computerized adaptive testing with

polytomous items. Applied psychological measurement. 1995. 19(1): p. 5-22.

23. Elhan AH, Öztuna D, Kutlay Ş, Küçükdeveci AA, Tennant A. An initial

application of computerized adaptive testing (CAT) for measuring disability

in patients with low back pain. BMC Musculoskeletal Disorders. 2008; Dec

18;9(1):1.

24. Ware JE Jr, et al. Applications of computerized adaptive testing (CAT) to the

assessment of headache impact. Quality of Life Research. 2003; Dec

1;12(8):935-52.

25. Aitken JF, Elwood JM, Lowe JB, Firman DW, Balanda KP, Ring IT. A

randomised trial of population screening for melanoma. Journal of Medical

Screening. 2002; Mar 1; 9(1):33-7.

26. DiSipio T, et al. The Queensland cancer risk study: behavioural risk factor

results. Australian and New Zealand journal of public health. 2006; Aug 1;

30(4):375-82.

27. Olsen CM, et al. Cohort profile: the QSkin sun and health study. International

journal of epidemiology. 2012; Aug 1; 41(4):929-i.

28. Brodie AM, et al. The AusD Study: a population-based study of the

determinants of serum 25-hydroxyvitamin D concentration across a broad

latitude range. American journal of epidemiology. 2013; Mar 22; kws322.



29. Linacre JM. Category, step and threshold: definitions & disordering. Rasch

measurement transactions. 2001; 15(1):794.

30. Andrich D. An expanded derivation of the threshold structure of the

polytomous rasch model that dispels any “threshold disorder controversy”.

Educational and Psychological Measurement. 2013; Feb 1;73(1):78-124.

31. Adams RJ, Wu ML, Wilson M. The Rasch rating model and the disordered

threshold controversy. Educational and Psychological Measurement. 2012;

Aug 1; 72(4):547-73.

32. Collins D. Pretesting survey instruments: an overview of cognitive methods.

Quality of life research. 2003; May 1; 12(3):229-38.

33. Conrad KJ, Wilson M. Constructing measures: An item response modeling

approach. Erlbaum Associates Mahwah, NJ. Evaluation and Program

Planning. 2005; Nov 30; 28(4):433-4.

34. Bond T, Fox CM. Applying the Rasch model: Fundamental measurement in

the human sciences. Routledge. 2015; Jun 5.

35. de Ayala RJ. An introduction to polytomous item response theory models.

Measurement and evaluation in Counseling and Development. 1993; Jan.

25(4): p. 172.

36. Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;

Jun 1; 47(2):149-74.

37. Wu M, et al, ACER ConQuest Version 2.0 manual : Generalised Item

Response Modelling Software. ACER Press: Victoria, Australia. 2007.

38. Linacre JM. A user’s guide to WINSTEPS MINISTEP Rasch-model computer

programs. Chicago IL: Winsteps. com. 2006.

39. Glanz K, et al. Validity of self-reported solar UVR exposure compared with

objectively measured UVR exposure. Cancer Epidemiology Biomarkers &

Prevention. 2010; Dec 1; 19(12):3005-12.

40. Glanz K, Schoenfeld E, Weinstock MA, Layi G, Kidd J, Shigaki DM.

Development and reliability of a brief skin cancer risk assessment tool.

Cancer detection and prevention. 2003; Dec 31; 27(4):311-5.

41. O'Riordan DL, et al. Validity of covering-up sun-protection habits:

association of observations and self-report. Journal of the American Academy

of Dermatology. 2009; May 31; 60(5):739-44.

42. Glanz K, et al. Validity of self-reported sunscreen use by parents, children,

and lifeguards. American journal of preventive medicine. 2009; Jan 31;

36(1):63-9.

Chapter 7: Discussion 159

Chapter 7: Discussion

Questionnaires have been used frequently in many studies as a preferred method to

measure skin cancer-related risk factors because they are widely considered to be

efficient, convenient, and economical. Most of these questionnaires, however, were

developed using classical psychometric methods and many do not appear to have

been rigorously tested according to current psychometric standards. As a result, few

previous studies could be found that have investigated the psychometric properties of

the questions to obtain precise estimates of the underlying latent trait (Table 1.1,

page 5-6). This thesis has provided a general introduction to item response theory

(IRT), presented an IRT framework that could be used in skin cancer research, and

aimed to familiarise clinicians and researchers in the skin cancer area regarding the

use of IRT to develop and test questionnaires for their suitability to the research

question and target group. The main findings of each chapter of this thesis are

summarised below and the findings and implications discussed. Methodological

consideration and suggestions for future studies are detailed in the last section of this

discussion chapter.

SUMMARY OF THE MAIN FINDINGS

Following the introduction and presentation of the IRT framework in Chapter 1, a

Rasch Rating Scale Model was applied Chapter 2 (page 38), to demonstrate that it is

possible to calibrate a skin self-examination attitude scale using an IRT approach.

This was one of the first studies worldwide to use IRT for skin cancer related attitude

assessment.

The results show the skin self-examination attitude scale is a brief, useful, and

reliable tool for assessing attitudes towards skin self-examination, thereby

legitimising its use in a population of men 50 years or older. It was also

demonstrated that the scale requires the addition of more items measuring positive

skin self-examination attitudes to cover the full range of latent constructs well.

In Chapter 3 (page 56), an IRT was used in a study to assess the potential impact of

self-reported sun protection and sun exposure behaviour change due to concern about

vitamin D on skin cancer risk at different latitudes in Australia. Visualising each

160 Chapter 7: Discussion

item’s location on the underlying latent construct of skin cancer risk enabled the

quantification of the impact that a potential change in people’s behaviours due to

concern about vitamin D deficiency may have on skin cancer risk.

In Chapter 4 (page 82), IRT was applied to assess whether it is possible to measure

phenotypes and behaviours related to skin cancer more efficiently. Using a computer

adaptive test simulation approach demonstrated that providing the questions in a

computer adaptive way can reduce participants’ burden by up to 66% compared to

non-adaptive testing, while retaining excellent measurement precision.

In Chapter 5 (page 105), data was evaluated from a large prospective cohort of skin

cancer risk in Queensland (the QSkin Sun and Health Study). Questions measuring

skin risk factors such as phenotype, sun exposure behaviours, and sun protection

behaviours were calibrated using Rasch partial credit model. The phenotype subscale

was found to be a good predictor of future development of non-melanoma skin

cancer (0.75); however, there was a lower explanatory power of the sun exposure

(0.53) or sun protection behaviour (0.49) scales.

Based on the best items from studies used in Chapters 3-5, Chapter 6 (page 121)

presented the development and evaluation of a new proposed scale for measuring

skin cancer risk consisting of three subscales: a phenotype, sun exposure behaviours,

and a sun protection behaviours subscale. The results show that the scale provides

adequate content coverage for all respondents. They also show moderate to high

correlation with existing validated questionnaires, moderate to high internal

consistency, and stability for all subscales. Only the phenotype (0.72) and sun

exposure behaviour (0.62) subscales differentiated well and moderately well,

respectively, between people who self-reported that they did or did not have non-

melanoma skin cancer in the past, while the sun protection scale (0.36) had little

association with skin cancer status.

DISCUSSION OF THE MAIN FINDINGS

The following section discusses how IRT was applied in this thesis.

7.2.1 Item Response Theory as a tool for evaluating psychometrics

properties of a questionnaire.

Until now, in health research, classical test theory (CTT) has predominantly been

used as a method for evaluating the qualities of a questionnaire, and this also applies


to many skin cancer questionnaires.23,61,160,161 Two indicators are usually derived

through this approach: item-total correlation and Cronbach’s alpha reliability;25,40,162

some studies have also assessed the validity of self-reported summary scales against

objective measurements, such as sunscreen cotton swabbing,46,163 physician clinical

examination,63 or UVR dosimetry,24,164 by correlating the overall scale score with the

outcome, for example, sunscreen yes/no. However, these indicators do not provide

information about the discriminative value of each individual question. For example,

item total correlation only allows the investigator to assess how strong each item

correlates with the scale total score, and assumes each question contributes equally to

the total score. Using an item response theory approach allows one to better evaluate

the psychometric properties of a questionnaire, including how well each

discriminates people with different skin cancer risk, and whether those with higher

risk correctly use the relevant response categories of the item, the degree of

information of each item, its standard error, how well it fits with the model, and

whether it is prone to differential item functioning (DIF).

Content coverage and item targeting.

The production of item-person maps (see Figure 2.1 in Study 1, page 48 and Figures

5.1, 5.2, 5.3 in Study 4 pages 113-115), can determine whether the scale covers a

wide range of content area, whether items have good content targeting (i.e., the items

are distributed evenly across the latent construct continuum), and also check for

evidence of ceiling or floor effects. This can only be done using IRT, where the item

and person parameters are calibrated on the same metric.49,165 Using this approach in

Study 1 showed that more items were required to measure high skin self-examination

attitude, as the current eight item scale contains few items covering that area of the

latent trait.149

Item location

Item map and item location (b parameter) provided an estimate for each item’s

difficulty or likelihood to be endorsed by people with different underlying trait. For

example, Study 1 showed that for item 3 “Checking my skin regularly is a priority

for me” b = 0.54. This means that only people who are quite aware of the importance

of skin self-examination are likely to answer yes to this item, compared to item 1 “It

is important to check my skin for skin cancer even if I have no symptom” which had


a b = -0.58. This additional information about the items cannot be obtained using a

CTT approach.

Differential Item Functioning (DIF)

IRT provides a procedure to check whether items function differently for different

subgroups of the population, for example men and women. During instrument

development and validation, it is important to ensure that items are as unbiased as

possible.93,158,166 Where differential item functioning is present and cannot be

avoided, the analyst must then be aware and adjust the analyses accordingly. For

example, Study 1149 found there was no significant DIF effect between the

intervention and control group, which means that all items were invariant across

groups.

7.2.2 Item Response Theory as a tool for developing a new a questionnaire.

IRT focuses on the statistical analysis of individual items, in contrast to CTT, where

development efforts focus on the test as a whole, which is intended to be provided in

its entirety for measurement purposes.80,167,168 For each item, IRT provides item

specific information about reliability, as well as unique information called an item

characteristic curve (presented in detail on page 12-14, Chapter 1), which visualises

where along the latent trait continuum the item measures optimally and the amount

of information it provides.169

Information function and standard error

IRT also estimates this information for each item, which allows the investigator to

create targeted item banks using the most informative items for each person. This has

the advantage that each person may answer a completely different set of items, but

still provide an equally accurate estimate of skin cancer risk.

7.2.3 Use of Computer Adaptive Test to reduce participants’ burden.

As described in the Introduction (pages 19-20), one major advantage of IRT is that it

allows the implementation of adaptive tests that tailor the difficulty of the test to each

individual participant, an advantage that has long been known to education,170-173 and

human resources selection testing,134,174 and is now increasingly applied to health-

related outcome measurement.135,175,176 The simulation modelling in Study 3177

(pages 94-95 Tables 4.1-3) demonstrated that CAT can be successfully applied to


reduce the length of the Skin Cancer Risk Scale by more than 60%. CAT was also

associated with smaller standard errors compared to non-adaptive testing. This thesis

argues that the use of CAT can improve accuracy and reduce the response burden

when assessing skin cancer risk. Interested readers can test an example of the CAT

used in this thesis at the publisher’s website:

(http://www.jmir.org/article/downloadSuppFile/4736/26665).

THE ASSESSMENT OF SKIN CANCER RISK.

Several self-administrated questionnaires have been developed to calculate skin

cancer risk scores, and researchers commonly calculate relative risks or odds ratios

estimated by logistic regression as scoring methods.178-180 A systematic review listed

twenty-five risk models for the prediction of melanoma,181 with 144 possible risk

factors, including 18 different measures of the number of nevi and 26 measures of

UVR exposure. The number of nevi, freckles, history of sunburn, skin colour, and

hair colour were frequently included in the final risk estimation model. All models

had similar discrimination, with area under the curve (AUC) of approximately 0.70 –

0.80,181 which was also similar to the result in Study 3, using the phenotype subscale

score only, which achieved an AUC of 0.72 (page 146, Figure 6.5). A major

weakness of most previous studies has been that only internal validation has been

conducted (using the original development population data set). Only one study180

has been validated in an external population. More recently, two studies attempted to

externally validate the performance of their melanoma risk prediction models. Olsen

and colleagues182 assessed the discriminatory performance of six melanoma

prediction models18,178-180,183,184 by using two independent data sets from The

Epigene185 and the QSkin studies.186 The results showed high discriminatory

performance for the six models with AUC values ranging from 0.73 (95% CI 0.71-

0.75) to 0.93 (95% CI 0.92-0.95). Vuong et al187 developed a melanoma risk

prediction model using the Australian Melanoma Family study,188 and then validated

it externally using four independent population-based studies: the Western Australia

Melanoma Study,189 Leeds Melanoma Case-Control Study,190,191 Epigene-QSkin

study,185,186 and Swedish Women’s Lifestyle and Health Cohort Study.192,193 The

model included hair colour, nevus density, previous melanoma skin cancer, first-

degree family history of melanoma, and lifetime sunbed use. The results showed

high discriminatory performance with AUC statistic of 0.70 (95% CI, 0.67-0.73) and

http://www.jmir.org/article/downloadSuppFile/4736/26665


0.63 (95% CI, 0.60-0.67) to 0.67 (95% CI, 0.65-0.70) for internal validation and

external validation, respectively.

Similar to the overarching aims of the risk prediction modes discussed above, the

primary objective of this study was to apply IRT methods to create a reliable, valid,

and precise tool for assessing skin cancer risk. After extensive preparatory work

using data from several existing studies (page 128-129, Chapter 6), this research

selected 53 items for further testing. It was hypothesised that skin cancer risk could

be measured by three main subscales: phenotype, sun exposure behaviours, and sun

protection behaviours. People with high scores in those scales tend to have a high

probability of getting skin cancer. Each item in the subscales was calibrated using the

Rasch partial credit model and the participant’s risk score was estimated. Scores

from each subscale should be able to discriminate and predict whether the participant

has low or high skin cancer risk. A prospective non-melanoma (Study 4) or self-

reported past (Study 5) skin cancer status was used as an outcome variable. This

increased the item bank, with good content coverage from 29 items from the QSkin

Study to 53 items. The new scales showed moderate to high correlation with existing

tools,54,194 and the discriminatory performance of the phenotype scale with AUC for

self-reported past melanoma of 0.72 (95% CI, 0.65-0.78) showed similar

discrimination compared to previous studies.181,182,187 While it is difficult to compare

results from different studies, as researchers have developed and used different

questions, this new scale has advantages compared to other questionnaires. The new

scale can be used to compare results from different studies through methods called

IRT linking and equating,130,195 and can also be used to create tailored assessment

using computer adaptive tests.177

Among the three subscales, the phenotype subscale was found to have the best

discriminatory performance (AUC = 0.72). In contrast, the sun protection behaviours

scale showed low discriminant performance (AUC = 0.36). This exemplifies a

considerable advantage of IRT allowing the user to determine which of the items

contributes most to the risk prediction score, and highlighting where further

development of suitable questions is required. While this was not attempted by the

previous studies mentioned above,182,187 it is likely a reason for the predominant use

of phenotype items in all risk prediction models.182 The low discriminant diagnostic

power of the sun protection items may be due to:


a) Recall effect: Few studies examined recall bias in self-reported melanoma

risk factors,196-198 all of them asked about sun exposure recall,199-201 and none

of them investigated sun protection recall bias, for example, reporting more

sun protection methods than actually applied. The recall effect on sun

protection behaviours is suspected in the current study, as all self-report

measures are subject to recall errors.202,203 Future studies should investigate

how accurately people can remember their sun protection behaviours and

compare it to observations or objective measures.

1. Social desirability bias: Similar to many constructs commonly investigated in

psychology, self-reported measures in this study may be subject to social

desirability.202,204,205 Participants may feel social pressure and report

favourable responses towards sun protection behaviours. Research into

sunscreen application has found that people apply less sunscreen than the

recommended amount of 2mg/cm2, but practice few sun protection

behaviours in day to day life.206,207 A study undertaken by Hall et al208 found

social norms supporting sun safety were associated with more sun protection

habits. The “Slip! Slop! Slap!” sun safety campaign may create social

desirability bias on people’s behaviour towards sun protection in Australia. It

is one of the most successful health campaigns in Australia’s history and was

launched by the Cancer Council in 1981.209 The objective of this campaign

was to reduce population exposure to sunlight and increase sun protection to

reduce the burden of skin cancer in Australia. The data suggests that

campaigns using the “Slip! Slop! Slap!” slogan continues to have a high level

of recall among adolescents.210 One of the explanations could be this

campaign made people aware of the risk of sun exposure and the need to

protect themselves when outside,211 leading to people reporting changing

their behaviours, including wearing hats, sunscreen, and protective

clothing.212 The effectiveness of this campaign is also shown by nine cross-

sectional surveys from 1987 to 2002,213 that investigated weekend sun

protection and sunburn in Australia and their association with SunSmart

advertising. The studies found a trend of improvement of sun-protection

behaviours compared with the period prior to the launch of the campaign.

Another online survey by the melanoma genetics consortium (GenoMEL)


that consisted of 12 countries (Australia, Germany, Israel, Italy, Latvia, the

Netherlands, Poland, Slovenia, Spain, Sweden, the United Kingdom, and the

United States) reported that Australians had the highest use of sun protection

compared with all other countries.214 Despite this, sunburn prevalence is still

high comparable to rates 20 years ago, indicating that people may over-report

their sun protection use. If the social desirability bias is high in the current

study, this would mean that these questions cannot be used to discriminate

between high risk and low risk skin cancer. These hypotheses require further

examination and more studies are needed.

b) All previous risk prediction models for melanoma,182,215,216 have focused

mainly on phenotypic factors, such as freckles, number of nevi, hair colour

and skin colour, none included sun protection behaviours. Furthermore, the

validation studies of self-reported sun protection behaviours commonly

assessed criterion validity only, mainly comparing against objective

sunscreen presence,46,217,218 observation or sun-related diaries.219,220 None of

these studies have examined the predictive validity of their measure using

skin cancer status as the outcome variable. It is possible that the low

discriminatory performance of the sun protection subscale in the current

study was possibly caused by a lack of association between non-phenotypic

factors with skin cancer risk. The association may also possibly be mediated

by other variables such as lifestyle,221 beliefs, knowledge, attitude,222-225 and

latitude not measured in this thesis;169 further investigation is required to

confirm this hypothesis.

METHODOLOGICAL CONSIDERATIONS AND FUTURE STUDIES

Relatively few studies in public health, especially in skin cancer research, have used

IRT approaches to develop and evaluate the psychometric properties of their

questionnaires. An important focus of this thesis was to investigate whether using

such approaches would provide benefit for public health researchers, practitioners,

and the general public. This has been partially fulfilled with the finding that using a

skin cancer risk scale in a computer adaptive test mode could be used with high

precision to identify people at risk, and could reduce the response burden. The

research also demonstrated that further work is required to improve the precision of

sun protection behaviours measurement.


7.4.1 Limitations of the research

1. Sample representativeness: Studies 1 to 4 used data from existing

studies, most with large samples, and reasonably representative samples.

Although Study 5 had a large number of participants (n=1,177), it

collected its data based on a convenience sample and most of the

participants were from the Queensland metropolitan area. Therefore,

results may not be representative of the general adult Australian

population, which was 49.4% male and 50.6%, whereas this study had

more females (76.3%) than males (23.1%). A few studies investigated the

effect of gender and found that women respond to web surveys at higher

rates than men.226,227 This phenomena has been previously reported by Sax

et al.226 Future studies will need to replicate the findings of this thesis, and

obtain a diverse sample to test the measurement invariance of the scale

factor structure between different groups. This future work would refine

the scale and provide evidence of its applicability in a representative

population. It is also recommended that an international sample be

recruited, so that these items can be tested for global use.

2. The effects of item sequence (position) in the questionnaire: A number

of studies showed that item difficulty (b parameters) can be influenced by

the order of items in a questionnaire (or the effect of changing item

position) in personality and educational assessment.228-232 In this study,

items were always presented in the following order: phenotype items, sun

exposure behaviours items, and sun protection behaviour items. Effects of

order of presentation, or questionnaire length, especially fatigue, rather

than the items themselves, may have led to the lower discriminative ability

of the sun protection behaviours items. The question of whether subscales

positioned later in a study are answered more uniformly than those

positioned near the beginning was investigated by Galesic, who provided

evidence for such a phenomenon.233 Future studies need to consider effects

of the positioning of items, both for online and paper pencil presentation of

questionnaires.

3. Differences in presentation to locally recruited and online panel: In

this study, a number of items were presented simultaneously on each page


for locally recruited participants, as opposed to the online panel in which

each item was presented one at a time (to ensure data synchronisation with

the third-party panel database). No previous study has investigated the

effect that such a difference in online presentation may have on

participants’ answer patterns. Results from studies that assess the

comparability of a paper and computer version of a questionnaire are

inconsistent,234-237 with some affirming234,235,237 and others denying236 a

difference in answer.

4. Advanced Item Response Theory models: As this study aimed to

demonstrate that IRT is useful in skin cancer risk assessment broadly, The

most commonly used models in IRT were applied so that they would also

be easily accessible for other researchers interested in using this approach.

Psychometricians are increasingly using more advanced statistics,

including multidimensional238-241 and multilevel (hierarchical)

models.242,243 These methods allow for the inclusion of interaction terms

between multiple risk factors and other variables, such as group and time-

specific effects in the model, and should be applied to skin cancer risk self-

reported data in future studies.

5. Longitudinal study design: The cross-sectional study design in Study 5

did not allow for analysis of various aspects of item performance, such as

item parameters stability, item parameter drift, or predictive validity for

future development of melanoma or non-melanoma skin cancers. A

longitudinal study is required to provide more detail into these various

aspects of measurement issues in the future. Furthermore, a longitudinal

study would allow assessment of sensitivity to change of the skin cancer

risk scale compared to an external anchor, such as objective measured skin

cancer.

6. The use of objective measurements for validation: A strength of this

study was the availability of prospective skin cancer diagnosis, which was

used in Study 4 to determine the predictive value of the three QSkin

phenotype, sun protection, and sun exposure subscales. Several previous

studies have compared self-reported UVR exposure and sun protection

behaviours scales with objective methods such as UVR dosimeter27,164,244-


246 and sunscreen cotton swab results;163 however, this was conducted in

the current study due to time and budget limitations. Future iterations of

the AusSun scale can be improved by incorporating objective methods and

obtaining prospective clinical skin cancer status from the cancer registry or

Medicare link database.

7.4.2 Suggestions for Future Implementation

This dissertation concludes with suggestions for future implementation of IRT

approaches in skin cancer-related research.

1. Given the significant reduction in response burden through the CAT

presented in Chapter 4, the development of a native app is suggested

(which is installed directly onto the smart phone and can work, in most

cases, without internet connection) for android and iOS, instead of using

web-based apps for skin cancer risk assessment. The benefits of a native

app are that it can work independently of the browser, work much faster

than a web application by optimising the power of the processor, has the

ability to connect to various wearable devices such as a Fitbit®, and can

access the hardware of the mobile phone, such as the light sensor and GPS.

Nowadays, almost all people have access to a smartphone, in addition to

personal computers. Mobile apps may attract more people to use the scale

and could then be linked with objective data, such as real-time location

[through internal global positioning satellite (GPS)] and UVR data, that

could be incorporated in more precise prediction models in the future.

2. Extension of the Computer Adaptive Testing module to other languages.

While the present study tested CAT with good success, the current CAT

module has the limitation of being unsuitable for people using languages

other than English. A multilingual interface could be added in the future to

overcome this limitation.

3. Dissemination of the Item Response Theory framework among skin cancer

researchers. Many people in skin cancer research are not trained in item

response theory; it would therefore be beneficial to familiarise them with

this framework so that they could consider using these modern

psychometric approaches in their own research.


Despite its limitations, this thesis could be a first step towards an international skin

cancer risk item bank. This work was presented and discussed with international

experts at the 3rd International Conference on UV and Skin Cancer Prevention held in

Melbourne in 2015, during a pre-conference workshop about surveillance of skin

cancer risk factors. The workshop was attended by world experts in skin cancer

research, including representatives from the National Cancer Institute, Centre for

Disease Control and Prevention, and Cancer Council Victoria. It was initiated to

highlight the need for greater standardisation of questionnaires used in skin cancer

research, as current practice prohibits comparison of findings across countries. Each

country, and even most studies or population-based surveys currently use slightly

different questions. Many of these questionnaires are used for historical reasons, and

cannot easily be changed, as the performance of individual items and their

contribution to the overall risk estimates is unknown. This lack of standardisation

makes it difficult to compare results across countries. One way to overcome this

problem is to create an International Item Bank for Skin Cancer Risk of items with

known reliability, risk estimates, and discrimination. IRT methods could therefore be

used to link various questionnaires so that their overlap and unique measurement

properties can be explored. This has been achieved successfully in educational

assessment (such as the Trends in International Mathematics and Science Study,247-

249 the Progress in International Reading Literacy Study, and Programme for

International Student Assessment)250-252 and work is underway to provide item

databanks in patient outcome assessment (NIH-PROMIS).253-255 This study can be

the first step in developing an international item bank for skin cancer assessment.

Similar to the NIH-PROMIS’ roadmap,256 this thesis proposes a roadmap for the

development of International Skin Cancer Risk Item Bank in Figure 7.1. The most

important step to accomplish this goal is to explore culturally, ethnically, and

linguistically diverse perceptions of skin cancer risk, sun exposure behaviours, and

sun protection behaviours. Studies have shown different patterns of sun exposure and

protection behaviours between cultures,257-260 it is therefore very important to have

an item bank that is linguistically equivalent, culturally relevant, and

psychometrically sound. It should be provided in multiple languages, developing a

standardised skin cancer risk assessment and allowing conduct of cross-cultural skin

cancer research. The 53-items of the SunAus scale developed in the present research

could be the first to be entered into the item database.


Figure 7.1: Roadmap for an International Skin Cancer Risk Item Bank

* This works needs to be done by an international community/consortium

CONCLUSION

This thesis explored a new approach to skin cancer-related measurement that is

rarely discussed in current health literature. The thesis provided empirical evidence

for the benefits of using IRT in questionnaire development and psychometric testing

in skin cancer research. The studies conducted in this thesis have therefore

demonstrated some of the advantages of IRT in various applications. The large

number of participants in this study and the successful implementation of the

phenotype subscale in discriminating peoples’ risk make the findings relevant to the

research community and also provide directions for future studies.

This new scale shows improvement compared to previous measures, such as wider

content coverage; being more precise, as its provides the exact location of individuals

and items on skin cancer risk continuum; and being more economical, as it is able to

predict risk with fewer items. These findings offer not only the initial evidence of the

Items from participating country

Field test

International Item Bank

CAT SKIN

-Item writing

(Anchor and

unique item)

-Item

mapping

IRT calibrated

item bank

reviewed for

reliability,

validity,

sensitivity

CAT version

of

International

Item Bank

for Skin

Cancer Risk

Questionnaire

administered to

large

representative

sample in each

country


usefulness of IRT in analysing and developing skin cancer-related questionnaires, but

also demonstrate the potential of its application to reduce participant burden without

compromising measurement precision via the implementation of computer adaptive

testing.

Lastly, IRT is not a universal solution for every assessment problem and it does not

correct problems of bias items or failure to meet predictive ability. IRT is also not a

substitute for classical methods that were influential and remain important. However,

IRT is a valuable tool that can and should be used to increase the quality of

assessment in epidemiological research. More research is required to demonstrate a

greater impact of IRT, especially within the field of skin cancer research and

practice. The major findings and implication for practice should make contributions

to knowledge generation, clinical practice, and policy-related issues in the field of

skin cancer preventive initiatives. In conclusion, the new IRT-based skin cancer risk

scale appears to be a promising tool for the assessment of skin cancer risk and is

recommend for use in other studies.

References 173

References

1. Geller AC, Emmons K, Brooks DR, et al. Skin cancer prevention and

detection practices among siblings of patients with melanoma. Journal of the

American Academy of Dermatology. 10// 2003; 49(4):631-638.

2. Hirst NG, Gordon LG, Scuffham PA, Green AC. Lifetime Cost-Effectiveness

of Skin Cancer Prevention through Promotion of Daily Sunscreen Use. Value

in Health. 3// 2012; 15(2):261-268.

3. Kyrgidis A, Tzellos TG, Vahtsevanos K, Triaridis S. New Concepts for Basal

Cell Carcinoma. Demographic, Clinical, Histological Risk Factors, and

Biomarkers. A Systematic Review of Evidence Regarding Risk for Tumor

Development, Susceptibility for Second Primary and Recurrence. Journal of

Surgical Research. 3// 2010; 159(1):545-556.

4. Sánchez G, Nova J, de la Hoz F. Risk Factors for Basal Cell Carcinoma: A

Study From the National Dermatology Center of Colombia. Actas Dermo-

Sifiliográficas (English Edition). 5// 2012; 103(4):294-300.

5. Wright TI, Spencer JM, Flowers FP. Chemoprevention of nonmelanoma skin

cancer. Journal of the American Academy of Dermatology. 6// 2006;

54(6):933-946.

6. Janda M, Kimlin M, Whiteman D, Aitken J, Neale R. Sun protection and low

levels of vitamin D: are people concerned? Cancer Causes Control.

2007/11/01 2007; 18(9):1015-1019.

7. Van Der Pols JC, Russell A, Bauer U, Neale RE, Kimlin MG, Green AC.

Vitamin D status and skin cancer risk independent of time outdoors: 11-year

prospective study in an Australian community. Journal of Investigative

Dermatology. // 2013; 133(3):637-641.

8. Janda M, Youl P, Bolz K, Niland C, Kimlin M. Knowledge about health

benefits of vitamin D in Queensland Australia. Preventive Medicine. // 2010;

50(4):215-216.

9. Youl PH, Janda M, Kimlin M. Vitamin D and sun protection: The impact of

mixed public health messages in Australia. International Journal of Cancer.

// 2009; 124(8):1963-1970.

10. Jayaratne N, Russell A, van der Pols JC. Sun protection and vitamin D status

in an Australian subtropical community. Preventive Medicine. 8// 2012;

55(2):146-150.

11. Garland CF, Garland FC, Gorham ED, Lipkin M, et al. The Role of Vitamin

D in Cancer Prevention. American Journal of Public Health. 2006;

96(2):252-261.

12. Hughes A, Hoffman J, Hoffman A. Vitamin D and sun exposure: To bare all

or cover up? Expert Review of Dermatology. // 2012; 7(6):495-497.

174 References

13. Australian Institute of Health and Welfare. Australian Cancer Incidence and

Mortality (ACIM) Books Canberra: Australian Institute of Health and

Welfare. 2012.

14. McGrath J, Kimlin M, Saha S, Eyles D, Parisi A. Vitamin D insufficiency in

south-east Queensland. Medical Journal of Australia. 2001; 174(3): 150-150.

15. Nowson CA, Margerison C. Vitamin D intake and vitamin D status of

Australians. Medical journal of Australia. 2002; 177(3): 149-152.

16. Marks R, Staples M, Giles GG. Trends in non-melanocytic skin cancer

treated in Australia: The second national survey. International Journal of

Cancer. 1993; 53(4):585-590.

17. Fransen M, Karahalios A, Sharma N, English DR, Giles GG, Sinclair RD.

Non-melanoma skin cancer in Australia. Medical Journal of Australia. 2012;

197(10):565-568.

18. MacKie R, Freudenberger T, Aitchison T. Personal risk-factor chart for

cutaneous melanoma. The Lancet. 1989; 334(8661):487-490.



screening program. BMC dermatology. 2008; 8(1):4.

20. Weinstock MA. Assessment of sun sensitivity by questionnaire: validity of

items and formulation of a prediction rule. Journal of clinical epidemiology.

1992; 45(5):547-552.



screening program. BMC Dermatology. 2008; 8(1):1-10.

22. Gillespie H, Watson T, Emery J, Lee A, Murchie P. A questionnaire to

measure melanoma risk, knowledge and protective behaviour: Assessing

content validity in a convenience sample of Scots and Australians. BMC

Medical Research Methodology. 2011; 11(1):123.

23. Morales-Sánchez MA, Peralta-Pedrero ML, Domínguez-Gómez MA.

Validation of a questionnaire to quantify the risk for skin cancer. Gaceta

Medica de Mexico. 2014; (150):409-419.

24. Humayun Q, Iqbal R, Azam I, Khan A, Siddiqui A, Baig-Ansari N.

Development and validation of sunlight exposure measurement questionnaire

(SEM-Q) for use in adult population residing in Pakistan. BMC Public

Health. 2012; 12(1):421.

25. de Troya-Martín M, Blázquez-Sánchez N, Rivas-Ruiz F, et al. Validation of a

Spanish Questionnaire to Evaluate Habits, Attitudes, and Understanding of

Exposure to Sunlight: “The Beach Questionnaire”. Actas Dermo-

Sifiliográficas (English Edition). // 2009; 100(7):586-595.

26. Glanz K, Schoenfeld E, Weinstock MA, Layi G, Kidd J, Shigaki DM.

Development and reliability of a brief skin cancer risk assessment tool.

Cancer Detection and Prevention. // 2003; 27(4):311-315.

References 175

27. Dwyer T, Blizzard L, Gies P, Ashbolt R, Roy C. Assessment of habitual sun

exposure in adolescents via questionnaire--a comparison with objective

measurement using polysulphone badges. Melanoma research. 1996;

6(3):231.

28. McCarty CA. Sunlight exposure assessment: can we accurately assess

vitamin D exposure from sunlight questionnaires? The American Journal of

Clinical Nutrition. April 2008 2008; 87(4):1097S-1101S.

29. Crescentini A, Zanolla G. The Evaluation of Mathematical Competency:

Elaboration of a Standardized Test in Ticino (Southern Switzerland).

Procedia - Social and Behavioral Sciences. 2/7/ 2014; 112:180-189.

30. Spooren AIF, Arnould C, Smeets RJEM, Bongers HMH, Seelen HAM.

Improvement of the Van Lieshout hand function test for Tetraplegia using a

Rasch analysis. Spinal Cord. 2013; 51(10):739-744.

31. El-Korashy AF. Applying the Rasch Model to the Selection of Items for a

Mental Ability Test. Educational and Psychological Measurement. October

1, 1995. 1995; 55(5):753-763.

32. Lee SH. Multidimensional item response theory: A SAS MDIRT macro and

empirical study of PIAT math test [Ph.D.]. Ann Arbor, The University of

Oklahoma; 2007.

33. Mary K. Tripp, Scott C. Carvajal, Laura K. McCormick, et al. Validity and

reliability of the Parental Sun Protection Scales. Health Education Research.

2003; 18(1).

34. Day AK, Wilson C, Roberts RM, Hutchinson AD. The Skin Cancer and Sun

Knowledge (SCSK) Scale: Validity, Reliability, and Relationship to Sun-

Related Behaviors Among Young Western Adults. Health Education &

Behavior. August 1, 2014. 2014; 41(4):440-448.

35. Staples MP, Elwood M, Burton RC, Williams JL, et al. Non-melanoma skin

cancer in Australia: the 2002 national survey and trends since 1985. Medical

Journal of Australia. 2006; 184(1):6-10.

36. Diffey BL, Norridge Z. Reported sun exposure, attitudes to sun protection

and perceptions of skin cancer risk: a survey of visitors to Cancer Research

UK’s SunSmart campaign website. British Journal of Dermatology. 2009;

160(6):1292-1298.

37. Newman WG, Agro AD, Woodruff SI, Mayer JA. A survey of recreational

sun exposure of residents of San Diego, California. American journal of

preventive medicine. 1996.

38. Tempark T, Chatproedprai S, Wananukul S. Attitudes, knowledge, and

behaviors of secondary school adolescents regarding protection from sun

exposure: a survey in Bangkok, Thailand. Photodermatology,

Photoimmunology & Photomedicine. 2012; 28(4):200-206.

39. Falk M, Anderson CD. Measuring sun exposure habits and sun protection

behaviour using a comprehensive scoring instrument – An illustration of a

176 References

possible model based on Likert scale scorings and on estimation of readiness

to increase sun protection. Cancer Epidemiology. 2012; 36(4):e265-e269.

40. Jennings L, Karia PS, Jambusaria-Pahlajani A, Whalen FM, Schmults CD.

The Sun Exposure and Behaviour Inventory (SEBI): Validation of an

instrument to assess sun exposure and sun protective practices. Journal of the

European Academy of Dermatology and Venereology. // 2013; 27(6):706-

715.

41. Wild D, Eremenco S, Mear I, et al. Multinational Trials—Recommendations

on the Translations Required, Approaches to Using the Same Language in

Different Countries, and the Approaches to Support Pooling the Data: The

ISPOR Patient-Reported Outcomes Translation and Linguistic Validation

Good Research Practices Task Force Report. Value in Health. 6// 2009;

12(4):430-440.

42. Gothwal VK, Bagga DK, Sumalini R. Rasch validation of the PHQ-9 in

people with visual impairment in South India. Journal of Affective Disorders.

10/1/ 2014; 167:171-177.

43. Lundgren-Nilsson Å, Dencker A, Jakobsson S, Taft C, Tennant A. Construct

Validity of the Swedish Version of the Revised Piper Fatigue Scale in an

Oncology Sample—A Rasch Analysis. Value in Health. 6// 2014; 17(4):360-

363.

44. Pilatti A, Read JP, Vera BdV, Caneto F, Garimaldi JA, Kahler CW. The

Spanish version of the Brief Young Adult Alcohol Consequences

Questionnaire (B-YAACQ): A Rasch Model analysis. Addictive Behaviors.

5// 2014; 39(5):842-847.

45. American Educational Research Association, American Psychological

Association, National Council on Measurement in Education. Standards for

educational and psychological testing. Amer Educational Research Assn;

1999.

46. Glanz K, McCarty F, Nehl EJ, et al. Validity of Self-Reported Sunscreen Use

by Parents, Children, and Lifeguards. American Journal of Preventive

Medicine. 2009; 36(1):63-69.

47. Hedges T, Scriven A. Young park users' attitudes and behaviour to sun

protection. Global Health Promotion. 2010; 17(4):24-31,90,96.

48. Horsburgh-McLeod GF, Gray AR, Reeder AI, McGee R. Applying Item

Response Theory (IRT) to a suntan attitudes scale. Australasian

Epidemiologist. 2010; 17(1):40.

49. van der Linden WJ, Hambleton RK. Handbook of modern item response

theory. Springer Science & Business Media; 2013.

50. Coory M, Baade P, Aitken J, Smithers M, McLeod GRC, Ring I. Trends for

in situ and invasive melanoma in Queensland, Australia, 1982–2002. Cancer

Causes Control. 2006; 17(1):21-27.

References 177

51. Jones WO, Harman CR, Ng AK, Shaw JH. Incidence of malignant melanoma

in Auckland, New Zealand: highest rates in the world. World journal of

surgery. 1999; 23(7):732-735.

52. Green A, Siskind V. Geographical distribution of cutaneous melanoma in

Queensland. The Medical journal of Australia. 1983; 1(9):407.

53. Jennings L, Karia PS, Jambusaria-Pahlajani A, Whalen FM, Schmults CD.

The Sun Exposure and Behaviour Inventory (SEBI): validation of an

instrument to assess sun exposure and sun protective practices. Journal of the

European Academy of Dermatology and Venereology. 2012; no-no.

54. Glanz K, Yaroch AL, Dancel M, et al. Measures of sun exposure and sun

protection practices for behavioral and epidemiologic research. Archives of

Dermatology. // 2008; 144(2):217-222.

55. Oliveria SA, Saraiya M, Geller AC, Heneghan MK, Jorgensen C. Sun

exposure and risk of melanoma. Archives Of Disease In Childhood. 2006;

91(2):131-138.

56. Armstrong BK. How sun exposure causes skin cancer: an epidemiological

perspective. Prevention of skin cancer: Springer; 2004:89-116.

57. Vu LH, van der Pols JC, Whiteman DC, Kimlin MG, Neale RE. Knowledge

and Attitudes about Vitamin D and Impact on Sun Protection Practices among

Urban Office Workers in Brisbane, Australia. Cancer Epidemiology

Biomarkers & Prevention. July 1, 2010. 2010; 19(7):1784-1789.

58. Dobbinson S, Wakefield M, Hill D, et al. Children's sun exposure and sun

protection: Prevalence in Australia and related parental factors. Journal of the

American Academy of Dermatology. Jun 2012; 66(6):938-947.

59. Schofield PE, Freeman JL, Dixon HG, Borland R, Hill DJ. Trends in sun

protection behaviour among Australian young adults. Australian and New

Zealand Journal of Public Health. 2001; 25(1):62-65.

60. Bränström R, Kristjansson S, Ullen H, Brandberg Y. Stability of

questionnaire items measuring behaviours, attitudes and stages of change

related to sun exposure. Melanoma research. 2002; 12(5):513-519.

61. Detert H, Hedlund S, Anderson CD, et al. Validation of sun exposure and

protection index (SEPI) for estimation of sun habits. Cancer Epidemiology.

2015; 39(6):986-993.

62. Dusza SW, Oliveria SA, Geller AC, Marghoob AA, Halpern AC. Student-

parent agreement in self-reported sun behaviors. Journal of the American

Academy of Dermatology. 2005; 52(5):896-900.

63. Morze CJ, Olsen CM, Perry SL, et al. Good test-retest reproducibility for an

instrument to capture self-reported melanoma risk factors. Journal of Clinical

Epidemiology. // 2012; 65(12):1329-1336.

64. Bond TG, Fox CM. Applying the Rasch model: Fundamental measurement in

the human sciences. Psychology Press; 2013.

178 References

65. Fries J, Bruce B, Cella D. The promise of PROMIS: using item response

theory to improve assessment of patient-reported outcomes. Clinical and

experimental rheumatology. 2005; 23(5):S53.

66. Schulz W. Validating Questionnaire Constructs in International Studies: Two

Examples from PISA 2000. Australian Council for Educational

ResearchMelbourne/Australia 2003.

67. Gonzalez EJ, Galia J, Li I. Scaling methods and procedures for the TIMSS

2003 mathematics and science scales. TIMSS. 2003; :252l273.

68. Beck CT, Gable RK. Postpartum Depression Screening Scale: development

and psychometric testing. Nursing Research. 2000; 49(5):272-282.

69. Revicki DA, Chen WH, Frank L, Feltner D, Morlock R. Development and

Analysis of Item Response Theory-based Short-form Depression Severity

Scales Based on the HDRS and MADRS. Health Outcomes Research in

Medicine. 12// 2010; 1(2):e111-e122.

70. Levine SZ, Rabinowitz J, Rizopoulos D. Recommendations to improve the

Positive and Negative Syndrome Scale (PANSS) based on item response

theory. Psychiatry Research. 8/15/ 2011; 188(3):446-452.

71. Spence R, Owens M, Goodyer I. Item response theory and validity of the

NEO-FFI in adolescents. Personality and Individual Differences. 10// 2012;

53(6):801-807.

72. Wainer H, Wang X. Using a New Statistical Model for Testlets to Score

TOEFL. Journal of Educational Measurement. 2000; 37(3):203-220.

73. Kingston NM, Dorans NJ. The Feasibility of Using Item Response Theory as

a Psychometric Model for the GRE Aptitude Test. GRE Board Professional

report GREB No. 79-12P. ETS Research Report 82-12. 1982.

74. Kingston N. An Exploratory Study of the Applicability of Item Response

Theory Methods to the Graduate Management Admission Test. Distributed

by ERIC Clearinghouse [Washington, D.C.] 1985.

<http://www.eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=ED2681

41>1985.

75. McKinley RL, Kingston NM. Exploring the use of IRT equating for the GRE

subject test in mathematics. Educational Testing Service; 1987.

76. Zara AR. Using Computerized Adaptive Testing to Evaluate Nurse

Competence for Licensure: Some History and Forward Look. Advances in

health sciences education. 1999; 4(1):39-48.

77. Downing SM. Item response theory: applications of modern test theory in

medical education. Medical Education. 2003; 37(8):739-745.

78. Traub RE. Classical test theory in historical perspective. Educational

Measurement: Issues and Practice. 1997; 16(4):8-14.

79. Anastasi A, Urbina S. Psychological testing (7th ed). New York, Upper

Saddle River: Macmillan; 1997.

References 179

80. Hambleton RK, Jones RW. Comparison of classical test theory and item

response theory and their applications to test development. Educational

measurement: issues and practice. 1993; 12(3):38-47.

81. Lord FM. Applications of item response theory to practical testing problems.

Routledge; 1980.

82. Hambleton RK, Swaminathan H, Roger HJ. Fundamentals of item response

theory. Newbury Park, California: Sage Publications; 1991.

83. Streiner DLP. Measure for Measure: New Developments in Measurement and

Item Response Theory. Canadian Journal of Psychiatry. 2010; 55(3):180-

186.

84. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes

measurement in the 21st century. Medical care. 2000; 38(9 Suppl):II28.

85. Molenaar IW. Some Background for Item Response Theory and the Rasch

Model. In: Fischer GH, Molenaar IW, eds. Rasch Models: Foundations,

Recent Developments, and Applications. New York, NY: Springer New York;

1995; :3-14.

86. Bock RD. A Brief History of Item Theory Response. Educational


87. Baker FB. The basics of item response theory. ERIC; 2001.

88. Yen WM. The Choice of Scale for Educational Measurement: An Art

Perspective. Journal of Educational Measurement. 1986; 23(4):299-325.

89. Maria OE, Bryce BR. Applying item response theory (IRT) modeling to

questionnaire development, evaluation, and refinement. Qual Life Res.

2007/08/01 2007; 16(1):5-18.

90. Reeve BB, Fayers P. Applying item response theory modeling for evaluating

questionnaire item and scale properties. Assessing quality of life in clinical

trials: methods of practice. 2005; 2:55-73.

91. Kline T. Psychological testing: A practical approach to design and

evaluation. Sage; 2005.

92. Osterlind SJ, Everson HT. Differential Item Functioning. Thousand Oaks,

CA: SAGE Publications, Inc.; 2010.

93. Walker CM. What's the DIF? Why Differential Item Functioning Analyses

Are an Important Part of Instrument Development and Validation. Journal of

Psychoeducational Assessment. May 19, 2011. 2011.

94. Abroms L, Jorgensen CM, Southwell BG, Geller AC, Emmons KM. Gender

differences in young adults’ beliefs about sunscreen use. Health Education &

Behavior. 2003; 30(1):29-43.

95. Groenvold M, Bjorner JB, Klee MC, Kreiner S. Test for item bias in a quality

of life questionnaire. Journal of Clinical Epidemiology. 6// 1995; 48(6):805-

816.

180 References

96. Camilli G, Shepard LA. Methods for identifying biased test items. Sage; 1994.

97. Clauser BE, Mazor KM. Using Statistical Procedures to Identify

Differentially Functioning Test Items. Educational Measurement: Issues and

Practice. 1998; 17(1):31-44.

98. Douglas JA, Roussos LA, Stout W. Item-Bundle DIF Hypothesis Testing:

Identifying Suspect Bundles and Assessing Their Differential Functioning.

Journal of Educational Measurement. 1996; 33(4):465-484.

99. Allalouf A, Hambleton RK, Sireci SG. Identifying the Causes of DIF in

Translated Verbal Items. Journal of Educational Measurement. 1999;

36(3):185-198.

100. Novick MR. The axioms and principal results of classical test theory. Journal

of Mathematical Psychology. 2// 1966; 3(1):1-18.

101. Fan X. Item Response Theory and Classical Test Theory: An Empirical

Comparison of their Item/Person Statistics. Educational and Psychological

Measurement. June 1, 1998 1998; 58(3):357-381.

102. Algina J, Crocker L. Introduction to classical and modern test theory. New

York: Wadsworth Publishing; 1986.

103. Degreef E, Buggenhaut JV, eds. Trends in mathematical psychology.

Amsterdam, North-Holland 1984. Mathematical social sciences; No. 13.

104. De Ayala RJ. The theory and practice of item response theory. Guilford Press

New York; 2009.

105. Green JL, Camilli G, Elmore PB. Handbook of complementary methods in

education research. Routledge; 2012.

106. Hays RD, Morales LS, Reise SP. Item Response Theory and Health

Outcomes Measurement in the 21st Century. Medical care. 2000; 38(9

Suppl):II28-II42.

107. Hambleton RK, Swaminathan H. Item response theory: Principles and

applications. Springer Science & Business Media; 2013.

108. Van der Linden WJ, Glas CAW, Interuniversitair Centrum voor

Onderwijskundig Onderzoek. Computerized adaptive testing : theory and

practice. Dordrecht; Boston: Kluwer Academic; 2000.

109. Revicki DA, Cella DF. Health status assessment for the twenty-first century:

item response theory, item banking and computer adaptive testing. Qual Life

Res. 1997/11/01 1997; 6(6):595-600.

110. Sygna K, Johansen S, Ruland CM. Recruitment challenges in clinical

research including cancer patients and caregivers. Trials. 2015; 16(1):1-9.

111. Ong AD, Van Dulmen MH. Oxford handbook of methods in positive

psychology. Oxford University Press New York; 2007.

112. Nering ML, Ostini R. Handbook of polytomous item response theory models.

Taylor & Francis; 2011.

References 181

113. Ostini R, Nering ML. Polytomous item response theory models. Sage; 2006.

114. Drasgow F, Levine MV, Tsien S, Williams B, Mead AD. Fitting Polytomous

Item Response Theory Models to Multiple-Choice Tests. Applied

Psychological Measurement. June 1, 1995 1995; 19(2):143-166.

115. Kolen MJ, Zeng L, Hanson BA. Conditional Standard Errors of Measurement

for Scale Scores Using IRT. Journal of Educational Measurement. 1996;

33(2):129-140.

116. Friborg O, Martinussen M, Rosenvinge JH. Likert-based vs. semantic

differential-based scorings of positive psychological constructs: A

psychometric comparison of two versions of a scale measuring resilience.

Personality and Individual Differences. 4// 2006; 40(5):873-884.

117. Prieto L, Alonso J, Lamarca R. Classical test theory versus Rasch analysis for

quality of life questionnaire reduction. Health and Quality of Life Outcomes.

07/28 04/11/received 07/28/accepted 2003; 1:27-27.

118. Downey RG, King CV. Missing Data in Likert Ratings: A Comparison of

Replacement Methods. The Journal of General Psychology. 1998/04/01

1998; 125(2):175-191.

119. Schaeffer GA, Bridgeman B, Golub-Smith ML, Lewis C, Potenza MT,

Steffen M. Comparability of paper-and-pencil and computer adaptive test

scores on the GRE General Test. ETS Research Report Series. 1998.

120. Wainer H. CATs: Whiter and whence. Psicológica: revista de metodología y

psicología experimental. 2000; 21(1):121-134.

121. Jette AM, Haley SM, Tao W, et al. Prospective evaluation of the AM-PAC-

CAT in outpatient rehabilitation settings. Physical Therapy. 2007; 87(4):385-

398.

122. Gardner W, Shear K, Kelleher K, et al. Computerized adaptive measurement

of depression: A simulation study. BMC Psychiatry. 2004;4(1):13.

123. Wainer H, Dorans NJ, Green BF, et al. Computerized adaptive testing: A

primer. Lawrence Erlbaum Associates, Inc; 1990.

124. Thompson NA, Weiss DJ. A framework for the development of computerized

adaptive tests. Practical Assessment, Research, and Evaluation. 2011; 16(1).

125. Embretson SE, Reise SP. Item Response Theory. Hoboken: Taylor and

Francis; 2013: http://QUT.eblib.com.au/patron/FullRecord.aspx?p=1166563.

126. Boyd AM. Strategies for controlling testlet exposure rates in computerized

adaptive testing systems [3110732]. United States -- Texas, The University of

Texas at Austin; 2003.

127. Orlando M, Sherbourne CD, Thissen D. Summed-score linking using item

response theory: Application to depression measurement. Psychological

Assessment. 2000; 12(3):354-359.

http://qut.eblib.com.au/patron/FullRecord.aspx?p=1166563

182 References

128. Kolen MJ, Brennan RL. Test Equating, Scaling, and Linking. New York:

Springer 2014.

129. Stocking M, Lord FM. Developing a common metric in item response theory.

Appl Psychol Meas. 1983; 7.

130. Kim SH, Cohen AS. A Comparison of Linking and Concurrent Calibration

Under Item Response Theory. Applied Psychological Measurement. June 1,

1998 1998; 22(2):131-143.

131. Schalet BD, Rothrock NE, Hays RD, et al. Linking Physical and Mental

Health Summary Scores from the Veterans RAND 12-Item Health Survey

(VR-12) to the PROMIS® Global Health Scale. Journal of General Internal

Medicine. 2015; 30(10):1524-1530.

132. Schalet BD, Revicki DA, Cook KF, Krishnan E, Fries JF, Cella D.

Establishing a Common Metric for Physical Function: Linking the HAQ-DI

and SF-36 PF Subscale to PROMIS® Physical Function. Journal of General

Internal Medicine. 2015; 30(10):1517-1523.

133. Lim RL. Linking Results of Distinct Assessments. Applied Measurement in

Education. 1993/01/01 1993; 6(1):83-102.

134. Kantrowitz TM, Dawson CR, Fetzer MS. Computer Adaptive Testing (CAT):

A Faster, Smarter, and More Secure Approach to Pre-Employment Testing.

Journal of Business and Psychology. 2011; 26(2):227-232.

135. Fayers PM. Applying item response theory and computer adaptive testing: the

challenges for health outcomes assessment. Qual Life Res. 2007; 16(1):187-

194.

136. Rebollo P, Castejon I, Cuervo J, et al. Validation of a computer-adaptive test

to evaluate generic health-related quality of life. Health and Quality of Life

Outcomes. 2010; 8(1):147.

137. Rao CR, Sinharay S. Handbook of statistics: Psychometrics. Vol 26: Elsevier;

2006.

138. Kline P. Handbook of psychological testing. Routledge; 2013.

139. Hambleton RK. Fundamentals of item response theory. Vol 2: Sage

publications; 1991.

140. Groth-Marnat G. Handbook of psychological assessment. John Wiley & Sons;

2009.

141. Hambleton RK. Test score validity and standard-setting methods. Criterion-

referenced measurement: The state of the art. 1980; 80:123.

142. Cizek GJ, Bunch MB. Standard setting: A guide to establishing and

evaluating performance standards on tests. SAGE Publications Ltd; 2007.

143. Sijtsma K, Hemker BT. A taxonomy of IRT models for ordering persons and

items using simple sum scores. Journal of Educational and Behavioral

Statistics. 2000; 25(4):391-415.

References 183

144. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response

theory. Newbury Park, California: Sage; 1991.

145. Wolfe F, Michaud K, Pincus T. Development and validation of the health

assessment questionnaire II: A revised version of the health assessment

questionnaire. Arthritis & Rheumatism. 2004; 50(10):3296-3305.

146. Lord F. A Theory of Test Scores (Psychometric Monograph No.7).

Richmond, VA: Psychometric Corporation; 1952.

147. Birnbaum A. Some latent train models and their use in inferring an

examinee's ability. Statistical theories of mental test scores. 1968; :395-479.

148. Andrich D. Application of a psychometric rating model to ordered categories

which are scored with successive integers. Applied Psychological

Measurement. 1978; 2:325-359.

149. Djaja N, Youl P, Aitken J, Janda M. Evaluation of a skin self examination

attitude scale using an item response theory model approach. Health and

Quality of Life Outcomes. 2014; 12(1):189.

150. Harris D. Comparison of 1-, 2-, and 3-Parameter IRT Models. Educational

Measurement: Issues and Practice. 1989 ;8(1):35-41.

151. DeMars C. Item response theory. Oxford University Press, USA; 2010.

152. Nguyen T, Han HR, Kim M, Chan K. An Introduction to Item Response

Theory for Patient-Reported Outcome Measurement. Patient. 2014/03/01

2014; 7(1):23-35.

153. Barnes LLB, Wise SL. The Utility of a Modified One-Parameter IRT Model

With Small Samples. Applied Measurement in Education. 1991/04/01 1991;

4(2):143-157.

154. Embretson SE, Hershberger SL. The new rules of measurement: what every

psychologist and educator should know. Mahwah, NJ: Erlbaum Associates;

1999.

155. Paek I, Han KT. IRTPRO 2.1 for Windows (Item Response Theory for

Patient-Reported Outcomes). Applied Psychological Measurement. May 1,

2013 2013; 37(3):242-252.

156. Guyer R, Thompson NA. User's manual for XCalibre 4.1 [computer

program]. St. Paul MN: Assessment Systems Corporation; 2011.

157. Teresi JA. Overview of quantitative measurement methods: Equivalence,

invariance, and differential item functioning in health applications. Medical

care. 2006; 44(11):S39-S49.

158. Downing SM, Haladyna TM. Handbook of test development. L. Erlbaum

Mahwah, NJ; 2006.

159. Wilson M. Constructing measures: An item response modeling approach.

Mahwah, New Jersey: Lawrence Erlbaum Associates; 2005.

184 References

160. Aygun O, Ergun A. Validity and Reliability of Sun Protection Behavior Scale

among Turkish Adolescent Population. Asian Nursing Research. 9// 2015;

9(3):235-242.

161. Wu S, Ho SC, Lam TP, et al. Development and validation of a lifetime

exposure questionnaire for use among Chinese populations. Scientific

Reports. 09/30/online 2013; 3:2793.

162. Borschmann RD, Cottrell D. Developing the readiness to alter sun-protective

behaviour questionnaire (RASP-B). Cancer Epidemiology. 2009; 33(6):451-

462.

163. O'Riordan DL, Lunde KB, Steffen AD, Maddock JE. Validity of beachgoers'

self-report of their sun habits. Archives of Dermatology. // 2006;

142(10):1304-1311.

164. Cargill J, Lucas RM, Gies P, et al. Validation of brief questionnaire measures

of sun exposure and skin pigmentation against detailed and objective

measures including vitamin D status. Photochemistry and Photobiology. //

2013; 89(1):219-226.

165. Reid CA, Kolakowsky-Hayner SA, Lewis AN, Armstrong AJ. Modern

Psychometric Methodology: Applications of Item Response Theory.

Rehabilitation Counseling Bulletin. 2007; 50(3):177-188.

166. Hambleton RK. Good practices for identifying differential item functioning.

Medical care. 2006; 44(11):S182-S188.

167. Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to

questionnaire development, evaluation, and refinement. Qual Life Res. 2007;

16:5-18.

168. Hambleton RK. Emergence of Item Response Modeling in Instrument

Development and Data Analysis. Medical Care. 2000; 38(9):II60-II65.

169. Djaja N, Janda M, Lucas RM, et al. Self-Reported Changes in Sun-Protection

Behaviours at different latitudes in Australia. Photochemistry and

Photobiology. 2016.

170. Weiss DJ, Kingsbury GG. Aplication of Computerized Adaptive testing to

Educational Problems. Journal of Educational Measurement. 1984;

21(4):361-375.

171. Chalhoub–Deville M, Deville C. Computer Adaptive Testing in Second

Language Contexts. Annual Review of Applied Linguistics. 1999; 19:273-299.

172. Weiss DJ. Computerized adaptive testing for effective and efficient

measurement in counseling and education. Measurement and Evaluation in

Counseling and Development. 2004; 37(2):70.

173. Lilley M, Barker T, Britton C. The development and evaluation of a software

prototype for computer-adaptive testing. Computers & Education. 8// 2004;

43(1–2):109-123.

References 185

174. Tippins NT, Beaty J, Drasgow F, et al. Unproctored Internet Testing in

Employment Settings. Personnel Psychology. 2006; 59(1):189-225.

175. Crins MHP, Roorda LD, Smits N, et al. Calibration and Validation of the

Dutch-Flemish PROMIS Pain Interference Item Bank in Patients with

Chronic Pain. PLoS ONE. 07/27 12/16/received 07/06/accepted

2015;10(7):e0134094.

176. Dijkers MP. A computer adaptive testing simulation applied to the FIM

instrument motor component. Archives of Physical Medicine and

Rehabilitation. 3// 2003; 84(3):384-393.

177. Djaja N, Janda M, Olsen CM, Whiteman DC, Chien TW. Estimating Skin

Cancer Risk: Evaluating Mobile Computer-Adaptive Testing. Journal of

Medical Internet Research. 2016; 18(e22).

178. Quereux G, Moyse D, Lequeux Y, et al. Development of an individual score

for melanoma risk. European Journal of Cancer Prevention. 2011; 20(3):217-

224.

179. Mar V, Wolfe R, Kelly JW. Predicting melanoma risk for the Australian

population. Australasian Journal of Dermatology. 2011; 52(2):109-116.

180. Fortes C, Mastroeni S, Bakos L, et al. Identifying individuals at high risk of

melanoma: a simple tool. European Journal of Cancer Prevention. 2010;

19(5):393-400.

181. Usher-Smith JA, Emery J, Kassianos AP, Walter FM. Risk Prediction Models

for Melanoma: A Systematic Review. Cancer Epidemiology Biomarkers &

Prevention. August 1, 2014 2014; 23(8):1450-1463.

182. Olsen CM, Neale RE, Green AC, et al. Independent Validation of Six

Melanoma Risk Prediction Models. J Invest Dermatol. 2015; 01/29/online.

183. Fears TR, Guerry D, Pfeiffer RM, et al. Identifying Individuals at High Risk

of Melanoma: A Practical Predictor of Absolute Risk. Journal of Clinical

Oncology. August 1, 2006 2006; 24(22):3590-3596.

184. Cho E, Rosner BA, Feskanich D, Colditz GA. Risk Factors and Individual

Probabilities of Melanoma for Whites. Journal of Clinical Oncology. April

20, 2005 2005; 23(12):2669-2675.

185. Kvaskoff M, Pandeya N, Green AC, et al. Site-Specific Determinants of

Cutaneous Melanoma: A Case–Case Comparison of Patients with Tumors

Arising on the Head or Trunk. Cancer Epidemiology Biomarkers &

Prevention. 2013; 22(12):2222-2231.

186. Olsen CM, Green AC, Neale RE, et al. Cohort profile: The QSkin Sun and

Health Study. International Journal of Epidemiology. August 1, 2012 2012;

41(4):929-929i.

187. Vuong K, Armstrong BK, Weiderpass E, et al. Development and external

validation of a melanoma risk prediction model based on self-assessed risk

factors. JAMA Dermatology. 2016.

186 References

188. Cust AE, Schmid H, Maskiell JA, et al. Population-based, Case-Control-

Family Design to Investigate Genetic and Environmental Influences on

Melanoma Risk: Australian Melanoma Family Study. American Journal of

Epidemiology. December 15, 2009 2009; 170(12):1541-1554.

189. English DR, Armstrong BK. Identifying people at high risk of cutaneous

malignant melanoma: results from a case-control study in Western Australia.

British medical journal (Clinical research ed.). 1988; 296(6632):1285.

190. Newton-Bishop JA, Chang YM, Elliott F, et al. Relationship between sun

exposure and melanoma risk for tumours in different body sites in a large

case-control study in a temperate climate. European Journal of Cancer. 3//

2011; 47(5):732-741.

191. Newton-Bishop JA, Chang YM, Iles MM, et al. Melanocytic Nevi, Nevus

Genes, and Melanoma Risk in a Large Case-Control Study in the United

Kingdom. Cancer Epidemiology Biomarkers & Prevention. August 1, 2010

2010; 19(8):2043-2054.

192. Veierød MB, Adami HO, Lund E, Armstrong BK, Weiderpass E. Sun and

Solarium Exposure and Melanoma Risk: Effects of Age, Pigmentary

Characteristics, and Nevi. Cancer Epidemiology Biomarkers & Prevention.

January 1, 2010. 2010; 19(1):111-120.

193. Roswall N, Sandin S, Adami HO, Weiderpass E. Cohort Profile: The Swedish

Women’s Lifestyle and Health cohort. International Journal of

Epidemiology. June 10, 2015.

194. National Human Genome Research Institute. PhenX Measure: Skin Cancer

2010; https://www.phenxtoolkit.org/toolkit_content/PDF/PX170601.pdf.

Accessed 1 June 2015.

195. Mislevy RJ. Linking Educational Assessments: Concepts, Issues, Methods,

and Prospects. Princeton, NJ: Educational Testing Service; 1992.

196. Van Der Mei IAF, Blizzard L, Ponsonby AL, Dwyer T. Validity and

reliability of adult recall of past sun exposure in a case-control study of

multiple sclerosis. Cancer Epidemiology Biomarkers and Prevention. // 2006;

15(8):1538-1544.

197. Cockburn M, Hamilton A, Mack T. Recall Bias in Self-reported Melanoma

Risk Factors. American Journal of Epidemiology. May 15, 2001 2001;

153(10):1021-1026.

198. Weinstock MA, Colditz GA, Willet WC, Stampfer MJ, Rosner B, Speizer FE.

Recall (Report) Bias and Reliability in the Retrospective Assessment of

Melanoma Risk. American Journal of Epidemiology. February 1, 1991 1991;

133(3):240-245.

199. van der Mei IAF, Blizzard L, Ponsonby AL, Dwyer T. Validity and

Reliability of Adult Recall of Past Sun Exposure in a Case-Control Study of

Multiple Sclerosis. Cancer Epidemiology Biomarkers & Prevention. August

1, 2006. 2006; 15(8):1538-1544.

http://www.phenxtoolkit.org/toolkit_content/PDF/PX170601.pdf

References 187

200. Wu S, Ho SC, Lam TP, et al. Development and validation of a lifetime

exposure questionnaire for use among Chinese populations. Scientific reports.

2013; 3.

201. Rosso S, Miñarro R, Schraub S, Tumino R, Franceschi S, Zanetti R.

Reproducibility of skin characteristic measurements and reported sun

exposure history. International Journal of Epidemiology. April 1, 2002. 2002;

31(2):439-446.

202. Buller DB, Cokkinides V, Hall HI, et al. Prevalence of sunburn, sun

protection, and indoor tanning behaviors among Americans: Review from

national surveys and case studies of 3 states. Journal of the American

Academy of Dermatology. 11// 2011; 65(5, Supplement 1):S114.e111-

S114.e111.

203. Adams A, Soumerai S, Lomas J, Ross-Degnan D. Evidence of self-report bias

in assessing adherence to guidelines. International Journal for Quality in

Health Care. 1999-06-01 00:00:00 1999; 11(3):187-192.

204. Manne S, Lessin S. Prevalence and Correlates of Sun Protection and Skin

Self-Examination Practices Among Cutaneous Malignant Melanoma

Survivors. J Behav Med. 2006; 29(5):419-434.

205. Weinstock MA, Risica PM, Martin RA, et al. Reliability of assessment and

circumstances of performance of thorough skin self-examination for the early

detection of melanoma in the Check-It-Out Project. Preventive Medicine. 6//

2004; 38(6):761-765.

206. Stenberg C, Larkö O. Sunscreen application and its importance for the sun

protection factor. Archives of Dermatology. 1985; 121(11):1400-1402.

207. Stokes R, Diffey B. How well are sunscreen users protected?

Photodermatology, Photoimmunology & Photomedicine. 1997; 13(5-6):186-

188.

208. Hall DM, McCarty F, Elliott T, Glanz K. Lifeguards' sun protection habits

and sunburns: Association with sun-safe environments and skin cancer

prevention program participation. Archives of Dermatology. 2009;

145(2):139-144.

209. Montague M, Borland R, Sinclair C. Slip! Slop! Slap! and SunSmart, 1980-

2000: Skin Cancer Control and 20 Years of Population-Based Campaigning.

Health Education & Behavior. June 1, 2001 2001; 28(3):290-305.

210. Smith BJ, Ferguson C, McKenzie J, Bauman A, Vita P. Impacts from

repeated mass media campaigns to promote sun protection in Australia.

Health Promotion International. March 1, 2002. 2002; 17(1):51-60.

211. Paul C, Tzelepis F, Girgis A, Parfitt N. The Slip Slop Slap years: Have they

had a lasting impact on today's adolescents? Health Promotion Journal of

Australia. 2003; 14(3):219-221.

188 References

212. Hill D, White V, Marks R, Borland R. Changes in sun-related attitudes and

behaviours, and reduced sunburn prevalence in a population at high risk of

melanoma. European journal of cancer prevention. 1993; 2(6):447-456.

213. Dobbinson SJ, Wakefield MA, Jamsen KM, et al. Weekend Sun Protection

and Sunburn in Australia: Trends (1987–2002) and Association with

SunSmart Television Advertising. American Journal of Preventive Medicine.

2// 2008; 34(2):94-101.

214. Bränström R, Kasparian NA, Chang YM, et al. Predictors of Sun Protection

Behaviors and Severe Sunburn in an International Online Study. Cancer

Epidemiology Biomarkers & Prevention. September 1, 2010 2010;

19(9):2199-2210.

215. Usher-Smith JA, Emery J, Kassianos AP, Walter FM. Risk prediction models

for melanoma: A systematic review. Cancer Epidemiology Biomarkers &

Prevention. June 3, 2014.

216. Quéreux G, Nguyen JM, Volteau C, Lequeux Y, Dréno B. Creation and test

of a questionnaire for self-assessment of melanoma risk factors. European

Journal of Cancer Prevention. 2010; 19(1):48-54.

217. O'Riordan D, Glanz K, Gies P, Elliott T. A pilot study of the validity of self-

reported ultraviolet radiation exposure and sun protection practices among

lifeguards, parents and children. Photochem Photobiol. 2008; 84:774 - 778.

218. O’Riordan, D, Lunde KB, Steffen AD, Maddock JE. . Validity of beachgoers'

self-report of their sun habits. Archives of Dermatology. 2006; 142(10):1304-

1311.

219. O'Riordan DL, Nehl E, Gies P, et al. Validity of covering-up sun-protection

habits: Association of observations and self-report. Journal of the American

Academy of Dermatology. 5// 2009; 60(5):739-744.

220. Oh SS, Mayer JA, Lewis EC, et al. Validating outdoor workers' self-report of

sun protection. Preventive Medicine. 2004; 39(4):798-803.

221. Santmyire BR, Feldman SR, Fleischer AB. Lifestyle high-risk behaviors and

demographics may predict the level of participation in sun-protection

behaviors and skin cancer primary prevention in the united states. Cancer.

2001; 92(5):1315-1324.

222. Bränström R, Brandberg Y, Holm L, Sjöberg L, Ullen H. Beliefs, knowledge

and attitudes as predictors of sunbathing habits and use of sun protection

among Swedish adolescents. European Journal of Cancer Prevention. 2001;

10(4):337-345.

223. Stone VB, Parker V, Quarterman M, Lee C. The relationship between skin

cancer knowledge and preventive behaviors used by parents. Dermatology

Nursing. 1999; 11(6):411.

224. Jackson KM, Aiken LS. A psychosocial model of sun protection and

sunbathing in young women: The impact of health beliefs, attitudes, norms,

and self-efficacy for sun protection. Health Psychology. 2000; 19(5):469-478.

References 189

225. Kim BH, Glanz K, Nehl EJ. Vitamin D beliefs and associations with

sunburns, sun exposure, and sun protection. International Journal of

Environmental Research and Public Health. // 2012; 9(7):2386-2395.

226. Sax LJ, Gilmartin SK, Bryant AN. Assessing Response Rates and

Nonresponse Bias in Web and Paper Surveys. Research in Higher Education.

2003; 44(4):409-432.

227. Underwood D, Kim H, Matier M. To Mail or To Web: Comparisons of

Survey Response Rates and Respondent Characteristics. 21-24 May presented

at 40th Annual Forum of the Association for Institutional Research; 2000;

Cincinnati, OH.

228. Ortner TM. On changing the position of items in personality questionnaires

Analysing effects of item sequence using IRT. Psychology Science. 2004;

46(4):466-476.

229. Kingston NM, Dorans NJ. The effect of the position of an item within a test

on item responding behavior: An analysis based on item response theory. ETS

Research Report Series. 1982; 1982(1):i-26.

230. Meyers JL, Miller GE, Way WD. Item position and item difficulty change in

an IRT-based common item equating design. Applied Measurement in

Education. 2008; 22(1):38-60.

231. Hohensinn C, Kubinger KD, Reif M, Holocher-Ertl S, Khorramdel L, Frebort

M. Examining item-position effects in large-scale assessment using the

Linear Logistic Test Model. Psychology Science. 2008; 50(3):391.

232. Hambleton RK, Traub RE. The Effects of Item Order on Test Performance

and Stress. The Journal of Experimental Education. 1974/09/01 1974

;43(1):40-46.

233. Galesic M, Bosnjak M. Effects of Questionnaire Length on Participation and

Indicators of Response Quality in a Web Survey. Public Opinion Quarterly.

June 20, 2009. 2009; 73(2):349-360.

234. Choi IC, Kim KS, Boo J. Comparability of a paper-based language test and a

computer-based language test. Language Testing. 2003; 20(3):295-320.

235. Lee GL, Weerakoon P. The role of computer-aided assessment in health

professional education: a comparison of student performance in computer-

based and paper-and-pen multiple-choice tests. Medical teacher. 2001;

23(2):152-157.

236. Clariana R, Wallace P. Paper–based versus computer–based assessment: key

factors associated with the test mode effect. British Journal of Educational

Technology. 2002; 33(5):593-602.

237. Miller ET, Neal DJ, Roberts LJ, et al. Test-retest reliability of alcohol

measures: is there a difference between internet-based assessment and

traditional methods? Psychology of Addictive Behaviors. 2002; 16(1):56.

190 References

238. Ackerman TA, Gierl MJ, Walker CM. Using multidimensional item response

theory to evaluate educational and psychological tests. Educational


239. McDonald RP. A basis for multidimensional item response theory. Applied

Psychological Measurement. 2000; 24(2):99-114.

240. Ackerman T. Graphical representation of multidimensional item response

theory analyses. Applied Psychological Measurement. 1996; 20(4):311-329.

241. Reckase M. Multidimensional item response theory. Vol 150: Springer; 2009.

242. Pastor DA. The use of multilevel item response theory modeling in applied

research: An illustration. Applied measurement in education. 2003;

16(3):223-243.

243. Van Nispen RMK, Dirk L, Langelaan M, De Boer MR, Terwee CB, Van

Rens GH. Applying multilevel item response theory to vision-related quality

of life in Dutch visually impaired elderly. Optometry & Vision Science. 2007;

84(8):710-720.

244. Gies P, Glanz K, O'Riordan D, Elliott T, Nehl E. Measured occupational solar

UVR exposures of lifeguards in pool settings. American Journal of Industrial

Medicine. 2009; 52(8):645-653.

245. Thieden E, Philipsen PA, Wulf HC. Compliance and data reliability in sun

exposure studies with diaries and personal, electronic UV dosimeters.

Photodermatology, Photoimmunology & Photomedicine. 2006; 22(2):93-99.

246. Glanz K, Gies P, O'Riordan DL, et al. Validity of Self-reported Solar UVR

Exposure Compared with Objectively Measured UVR Exposure. Cancer

Epidemiology Biomarkers & Prevention. December 1, 2010. 2010;

19(12):3005-3012.

247. Mullis IV, Martin MO, Gonzalez EJ, Chrostowski SJ. TIMSS 2003

International Mathematics Report: Findings from IEA's Trends in

International Mathematics and Science Study at the Fourth and Eighth

Grades. ERIC; 2004.

248. Neidorf TS, Binkley M, Gattis K, Nohara D. Comparing Mathematics

Content in the National Assessment of Educational Progress (NAEP), Trends

in International Mathematics and Science Study (TIMSS), and Program for

International Student Assessment (PISA) 2003 Assessments. Technical

Report. NCES 2006-029. National Center for Education Statistics. 2006.

249. Reddy V. Cross‐national achievement studies: learning from South Africa's

participation in the Trends in International Mathematics and Science Study

(TIMSS). Compare: A Journal of Comparative and International Education.

2005; 35(1):63-77.

250. Mullis IV, Kennedy AM, Martin MO, Sainsbury M. PIRLS 2006 Assessment

Framework and Specifications: Progress in International Reading Literacy

Study. ERIC; 2004.

References 191

251. Martin MO, Mullis IV, Kennedy AM. Progress in International Reading

Literacy Study (PIRLS): PIRLS 2006 Technical Report. ERIC; 2007.

252. Goldstein H. International comparisons of student attainment: some issues

arising from the PISA study. Assessment in Education: principles, policy &

practice. 2004; 11(3):319-330.

253. Bevans M, Ross A, Cella D. Patient-Reported Outcomes Measurement

Information System (PROMIS(®)): Efficient, Standardized Tools to Measure

Self-Reported Health and Quality of Life. Nursing outlook. Sep-Oct 06/12

2014; 62(5):339-345.

254. Schalet BD, Cook KF, Choi SW, Cella D. Establishing a Common Metric for

Self-Reported Anxiety: Linking the MASQ, PANAS, and GAD-7 to

PROMIS Anxiety. Journal of anxiety disorders. 12/01 2014; 28(1):88-96.

255. Riley WT, Pilkonis P, Cella D. Application of the National Institutes of

Health Patient-Reported Outcomes Measurement Information System

(PROMIS®) to Mental Health Research. The journal of mental health policy

and economics. 2011; 14(4):201-208.

256. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes

Measurement Information System (PROMIS): Progress of an NIH Roadmap

Cooperative Group During its First Two Years. Medical Care. 2007;

45(5):S3-S11.

257. Coups EJ, Stapleton JL, Hudson SV, et al. Linguistic acculturation and skin

cancer–related behaviors among hispanics in the southern and western united

states. JAMA Dermatology. 2013; 149(6):679-686.

258. Coups EJ, Stapleton JL, Hudson SV, et al. Skin cancer surveillance behaviors

among US Hispanic adults. Journal of the American Academy of

Dermatology. 4// 2013; 68(4):576-584.

259. Korta DZ, Saggar V, Wu TP, Sanchez M. Racial differences in skin cancer

awareness and surveillance practices at a public hospital dermatology clinic.

Journal of the American Academy of Dermatology. 2014; 70(2):312-317.

260. Harvey VM, Patel H, Sandhu S, Wallington SF, Hinds G. Social determinants

of racial and ethnic disparities in cutaneous melanoma outcomes. Cancer

Control. 2014; 21(4):343-349.

Appendices 193

Appendices

Appendix – SunAUS Scale

194 Appendices

Appendices 195

196 Appendices

Appendices 197

198 Appendices

Appendices 199

200 Appendices

Appendices 201

202 Appendices

Appendices 203

204 Appendices

Appendices 205

206 Appendices

Appendices 207

208 Appendices

APPLICATIONS OF MODERN TEST THEORY IN SKIN CANCER · Item Response Theory: Applications of modern...

Documents

Transcript of APPLICATIONS OF MODERN TEST THEORY IN SKIN CANCER · Item Response Theory: Applications of modern...