Poster: Test-Retest Reliability and Equivalence of PRO Measures

1
A literature review of the variance in ‘interval length’ between administrations for assessment of test-retest reliability and equivalence of PRO measures Helen Anderson 1 , Nuz Quadri 1 , Diane Wild 1 , Paul O’Donohoe 2 Willie Muehlhausen 1 1 Oxford Outcomes, an ICON plc Company, Oxford, United Kingdom 2 CRF Health, London, United Kingdom www.oxfordoutcomes.com Background Repeatability or test-retest reliability is an important component of the psychometric validation of patient-reported outcome (PRO) measures, and is referred to in the FDA PRO guidance document (2009) as being a key indicator of an instrument’s validity. Equivalence testing is designed to evaluate the comparability between PRO scores from an electronic mode of administration and paper and pencil administration, or between various electronic platforms. Coons et al (2009) recommend that when the original PRO has undergone a moderate change during its migration to an electronic platform, an equivalence study is required to ensure that the psychometric properties haven’t changed. There are a number of related designs available for both test-retest reliability and equivalence VWXGLHV ,Q WHVWUHWHVW VWXGLHV WKH DGPLQLVWUDWLRQ SODWIRUP LV WKH VDPH RQ WKH ¿UVW DQG VHFRQG administration (paper or electronic), but in equivalence studies respondents will complete one administration on the original version (usually paper) and the other on an electronic platform. 5HOLDELOLW\ LV GH¿QHG DV WKH UDWLR RI WKH YDULDWLRQ RI WUXH VFRUHV WR WKH YDULDWLRQ RI REVHUYHG scores (Laenen et al., 2006), often measured by test-retest correlations. Discrepancies between the scores can occur due to transient or temporal error, which is error due to the repeated measurement of the same subject at different time points (Schmidt et al., 2003). Various factors can contribute towards this type of error: carryover effects such as memory and practice, the recall period used in the PRO, and the stability of the condition being measured. One of the GHVLJQ GHFLVLRQV WR EH PDGH LV WKH SHULRG RI WLPH EHWZHHQ WKH ¿UVW DQG VHFRQG DGPLQLVWUDWLRQ RI the measure. A shorter interval runs the risk of potential memory or practice effects and a longer period runs the risk of the condition having changed between intervals. There is very little literature addressing the issue of the appropriate length of interval required between two administrations (Marx et al., 2003). The FDA PRO guidance document states that “the time interval chosen depends on the variability of the state or experience being evaluated and RQ WKH SRWHQWLDO IRU FKDQJH LQ WKH FRQGLWLRQ RU SRSXODWLRQ RYHU WLPH WKDW UHÀHFWV DFWXDO FKDQJH LQ the condition rather than variability in stable patients.” The objective of this literature review was to determine what administration intervals are commonly used in the development and validation of PROs and to determine whether there is any pattern in terms of what is currently done based on the criteria described above. Method A literature search was conducted in PsychInfo, using the following search terms: ‘test retest reliability’, ‘equivalence testing’, ‘washout period’, ‘interval’. The search was limited to the past 10 years (2003-2013) and to ‘English language’ articles, yielding a total of 554 abstracts. Forty-six additional abstracts from a meta-analytic review of equivalence studies conducted by Gwaltney et al (2008) were included. A further 65 abstracts were included from a more recent meta-analysis (in press), resulting in a total of 665 abstracts. The abstracts were reviewed by researchers, who extracted and collated the administration interval where available. Full papers were retrieved where required in order to obtain the interval used. Abstracts were included if they were test-retest and/or equivalence studies, and used a PRO measure. Studies were excluded if clinical outcomes assessments other than PROs were used, if a cross-over design was not used, and if the interval was not clear from the full paper. )LJXUH EHORZ VKRZV WKH QXPEHUV RI DEVWUDFWV UHYLHZHG IURP HDFK VRXUFH DQG WKH ¿QDO QXPEHU of studies reviewed to extract the information. Results Of the 375 studies reviewed, 99 studies were equivalence studies and 276 were test-retest studies. The studies showed a huge amount of variance in administration interval used, ranging from no variance (completed immediately) to a 7-year interval. The variance in administration intervals for test-rest studies was 1 minute to 7 years. The most commonly used interval was 2 weeks (22%). The variance in administration intervals for equivalence studies was no interval to 1 month (with an outlier of a 6 month interval). The most commonly used interval was one hour or less (30%). Information on the medical conditions that were investigated in the studies was also extracted. For the test-retest studies the most common conditions were mental health conditions (such as anxiety, depression, and bipolar disorder), fatigue, cancer, and pain. For the equivalence studies, the most common conditions were mental health, respiratory (such as asthma and chronic obstructive pulmonary disease (COPD)), arthritic conditions (such as rheumatoid arthritis and osteoarthritis), cancer, and pain. In order to understand more about how the intervals were used across both types of studies, the intervals of three conditions were assessed more closely: pain, mental health and cancer. The interval used in these three conditions is provided in Figures 2 and 3. Figure 2. Interval used in equivalence studies for pain, cancer and mental health Figure 3. Interval used in test-retest studies for pain, cancer and mental health The results from analysis of these three conditions show that although they are the same conditions being investigated, the interval is different for the type of the study being conducted. The equivalence study intervals are shorter with a modal interval of one hour or less, whereas the test-retest study intervals are longer with a modal interval of two weeks to one month. Figure 1. Flow chart of number of abstracts reviewed Conclusion There is no clear guidance on what interval is most appropriate to use in test-retest or equivalence studies, beyond the need to balance considerations of changes in health state and the need for SDWLHQWV WR IRUJHW KRZ WKH\ UHVSRQGHG GXULQJ WKH ¿UVW FRPSOHWLRQ 7KLV ODFN RI JXLGDQFH LV UHÀHFWHG LQ WKH ZLGH YDULHW\ RI LQWHUYDOV VHHQ LQ WKH OLWHUDWXUH )XUWKHU complications are seen in the difference of interval lengths used for different types of studies (i.e. test-retest and equivalence) and also for different conditions. While the literature seems to indicate the use of different interval lengths for test-retest versus equivalence studies in the same FRQGLWLRQV WKH DXWKRUV DUH QRW DZDUH RI DQ\ WKHRUHWLFDO UHDVRQ IRU ZK\ WKLV PLJKW EH MXVWL¿HG 7KHVH ¿QGLQJV KLJKOLJKW WKH QHHG IRU D FRQVLGHUHG GLVFXVVLRQ DURXQG WKH LGHQWL¿FDWLRQ RI appropriate interval length for test-retest and equivalence studies. Issues that need to be considered when selecting the most appropriate interval include: the stability of the condition, the complexity and length of the measure, and the recall period used in the measure. References Coons SJ, Gwaltney CJ, Hays RD, et al. (2009). Recommendations On Evidence Needed To Support Measurement Equivalence Between Electronic And Paper-Based Patient-Reported Outcome (PRO) Measures: ISPOR ePRO Good Research Practices Task Force Report. Value Health, 12, 419-429. Gwaltney CJ, Shields AL, Shiffman S. (2008). Equivalence of electronic and paper-and-pencil administration of patient-reported outcomes measures: a meta-analytic review. Value Health, 11, 322- 333. Laenen A, Vangeneugden T, Geys H, et al. (2006) Generalized reliability estimation using repeated measurements. British Journal of Mathematical and Statistical Psychology, 59, 113-131. Marx RG, Menezes A, Horovitz L, et al. (2003) A comparison of two time intervals for test-retest reliability of health status instruments. Journal of Clinical Epidemiology, Volume 56, Issue 8, August 2003, Pages 730–735. Schmidt FL, Le H, Ilies R. (2003). Beyond Alpha: an empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206–224. US Food and Drug Administration: Final Guidance for Industry (2009). Patient-reported outcome measures: Use in medical product development to support labelling claims. PsycInfo 554 Gwaltney et al Recent meta-analysis (in press) 282 40 53 46 65 376 Number of abstracts reviewed: Number of relevant studies: Total number of studies reviewed: Source: 0 2 4 6 8 10 12 14 Interval length categories Test-retest studies Frequency 1 hour or less 1 hour to 1 day 1 day to 1 week 1 week to 2 weeks 2 weeks to 1 month 1 to 2 months 2 months or over Pain Mental health Cancer 0 1 2 3 4 5 6 7 Interval length categories Equivalence studies Frequency 1 hour or less 1 hour to 1 day 1 day to 1 week 1 week to 2 weeks 2 weeks to 1 month 1 to 2 months 2 months or over Pain Mental health Cancer

Transcript of Poster: Test-Retest Reliability and Equivalence of PRO Measures

Page 1: Poster: Test-Retest Reliability and Equivalence of PRO Measures

A literature review of the variance in ‘interval length’ between administrations

for assessment of test-retest reliability and equivalence of PRO measures

Helen Anderson1, Nuz Quadri1, Diane Wild1, Paul O’Donohoe2 Willie Muehlhausen1

1Oxford Outcomes, an ICON plc Company, Oxford, United Kingdom 2CRF Health, London, United Kingdom

www.oxfordoutcomes.com

Background

Repeatability or test-retest reliability is an important component of the psychometric validation of patient-reported outcome (PRO) measures, and is referred to in the FDA PRO guidance document (2009) as being a key indicator of an instrument’s validity. Equivalence testing is designed to evaluate the comparability between PRO scores from an electronic mode of administration and paper and pencil administration, or between various electronic platforms. Coons et al (2009) recommend that when the original PRO has undergone a moderate change during its migration to an electronic platform, an equivalence study is required to ensure that the psychometric properties haven’t changed.

There are a number of related designs available for both test-retest reliability and equivalence VWXGLHV���,Q�WHVW�UHWHVW�VWXGLHV�WKH�DGPLQLVWUDWLRQ�SODWIRUP�LV�WKH�VDPH�RQ�WKH�¿UVW�DQG�VHFRQG�administration (paper or electronic), but in equivalence studies respondents will complete one administration on the original version (usually paper) and the other on an electronic platform.

5HOLDELOLW\�LV�GH¿QHG�DV�WKH�UDWLR�RI�WKH�YDULDWLRQ�RI�WUXH�VFRUHV�WR�WKH�YDULDWLRQ�RI�REVHUYHG�scores (Laenen et al., 2006), often measured by test-retest correlations. Discrepancies between the scores can occur due to transient or temporal error, which is error due to the repeated measurement of the same subject at different time points (Schmidt et al., 2003). Various factors can contribute towards this type of error: carryover effects such as memory and practice, the recall period used in the PRO, and the stability of the condition being measured. One of the GHVLJQ�GHFLVLRQV�WR�EH�PDGH�LV�WKH�SHULRG�RI�WLPH�EHWZHHQ�WKH�¿UVW�DQG�VHFRQG�DGPLQLVWUDWLRQ�RI�the measure. A shorter interval runs the risk of potential memory or practice effects and a longer period runs the risk of the condition having changed between intervals.

There is very little literature addressing the issue of the appropriate length of interval required between two administrations (Marx et al., 2003). The FDA PRO guidance document states that “the time interval chosen depends on the variability of the state or experience being evaluated and RQ�WKH�SRWHQWLDO�IRU�FKDQJH�LQ�WKH�FRQGLWLRQ�RU�SRSXODWLRQ�RYHU�WLPH�WKDW�UHÀHFWV�DFWXDO�FKDQJH�LQ�the condition rather than variability in stable patients.”

The objective of this literature review was to determine what administration intervals are commonly used in the development and validation of PROs and to determine whether there is any pattern in terms of what is currently done based on the criteria described above.

Method

A literature search was conducted in PsychInfo, using the following search terms: � � ‘test retest reliability’, � � ‘equivalence testing’, � � ‘washout period’, � � ‘interval’.

The search was limited to the past 10 years (2003-2013) and to ‘English language’ articles, yielding a total of 554 abstracts.

Forty-six additional abstracts from a meta-analytic review of equivalence studies conducted by Gwaltney et al (2008) were included. A further 65 abstracts were included from a more recent meta-analysis (in press), resulting in a total of 665 abstracts.

The abstracts were reviewed by researchers, who extracted and collated the administration interval where available. Full papers were retrieved where required in order to obtain the interval used.

Abstracts were included if they were test-retest and/or equivalence studies, and used a

PRO measure. Studies were excluded if clinical outcomes assessments other than PROs were used, if a cross-over design was not used, and if the interval was not clear from the full paper. )LJXUH���EHORZ�VKRZV�WKH�QXPEHUV�RI�DEVWUDFWV�UHYLHZHG�IURP�HDFK�VRXUFH��DQG�WKH�¿QDO�QXPEHU�of studies reviewed to extract the information.

Results

Of the 375 studies reviewed, 99 studies were equivalence studies and 276 were test-retest studies. The studies showed a huge amount of variance in administration interval used, ranging from no variance (completed immediately) to a 7-year interval.

The variance in administration intervals for test-rest studies was 1 minute to 7 years. The most commonly used interval was 2 weeks (22%). The variance in administration intervals for equivalence studies was no interval to 1 month (with an outlier of a 6 month interval). The most commonly used interval was one hour or less (30%).

Information on the medical conditions that were investigated in the studies was also extracted. For the test-retest studies the most common conditions were mental health conditions (such as anxiety, depression, and bipolar disorder), fatigue, cancer, and pain. For the equivalence studies, the most common conditions were mental health, respiratory (such as asthma and chronic obstructive pulmonary disease (COPD)), arthritic conditions (such as rheumatoid arthritis and osteoarthritis), cancer, and pain.

In order to understand more about how the intervals were used across both types of studies, the intervals of three conditions were assessed more closely: pain, mental health and cancer. The interval used in these three conditions is provided in Figures 2 and 3.

Figure 2. Interval used in equivalence studies for pain, cancer and mental health

Figure 3. Interval used in test-retest studies for pain, cancer and mental health

The results from analysis of these three conditions show that although they are the same conditions being investigated, the interval is different for the type of the study being conducted. The equivalence study intervals are shorter with a modal interval of one hour or less, whereas the test-retest study intervals are longer with a modal interval of two weeks to one month.

Figure 1. Flow chart of number of abstracts reviewed

Conclusion

There is no clear guidance on what interval is most appropriate to use in test-retest or equivalence studies, beyond the need to balance considerations of changes in health state and the need for SDWLHQWV�WR�IRUJHW�KRZ�WKH\�UHVSRQGHG�GXULQJ�WKH�¿UVW�FRPSOHWLRQ��

7KLV�ODFN�RI�JXLGDQFH�LV�UHÀHFWHG�LQ�WKH�ZLGH�YDULHW\�RI�LQWHUYDOV�VHHQ�LQ�WKH�OLWHUDWXUH��)XUWKHU�complications are seen in the difference of interval lengths used for different types of studies (i.e. test-retest and equivalence) and also for different conditions. While the literature seems to indicate the use of different interval lengths for test-retest versus equivalence studies in the same FRQGLWLRQV��WKH�DXWKRUV�DUH�QRW�DZDUH�RI�DQ\�WKHRUHWLFDO�UHDVRQ�IRU�ZK\�WKLV�PLJKW�EH�MXVWL¿HG�

7KHVH�¿QGLQJV�KLJKOLJKW�WKH�QHHG�IRU�D�FRQVLGHUHG�GLVFXVVLRQ�DURXQG�WKH�LGHQWL¿FDWLRQ�RI�appropriate interval length for test-retest and equivalence studies. Issues that need to be considered when selecting the most appropriate interval include: the stability of the condition, the complexity and length of the measure, and the recall period used in the measure.

References

Coons SJ, Gwaltney CJ, Hays RD, et al. (2009). Recommendations On Evidence Needed To Support Measurement Equivalence Between Electronic And Paper-Based Patient-Reported Outcome (PRO) Measures: ISPOR ePRO Good Research Practices Task Force Report. Value Health, 12, 419-429.

Gwaltney CJ, Shields AL, Shiffman S. (2008). Equivalence of electronic and paper-and-pencil administration of patient-reported outcomes measures: a meta-analytic review. Value Health, 11, 322-333.

Laenen A, Vangeneugden T, Geys H, et al. (2006) Generalized reliability estimation using repeated measurements. British Journal of Mathematical and Statistical Psychology, 59, 113-131. Marx RG, Menezes A, Horovitz L, et al. (2003) A comparison of two time intervals for test-retest reliability of health status instruments. Journal of Clinical Epidemiology, Volume 56, Issue 8, August 2003, Pages 730–735. Schmidt FL, Le H, Ilies R. (2003). Beyond Alpha: an empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206–224. US Food and Drug Administration: Final Guidance for Industry (2009). Patient-reported outcome measures: Use in medical product development to support labelling claims.

PsycInfo

554

Gwaltney

et al

Recent meta-analysis

(in press)

282 40 53

46 65

376

Number of

abstracts reviewed:

Number of

relevant studies:

Total number of

studies reviewed:

Source:

0

2

4

6

8

10

12

14

Interval length categories

Test-retest studies

Freq

uen

cy

1 hour or less

1 hour to 1 day

1 day to 1 week

1 week to 2 weeks

2 weeks to 1 month

1 to 2months

2 monthsor over

Pain

Mental health

Cancer

0

1

2

3

4

5

6

7

Interval length categories

Equivalence studies

Fre

qu

en

cy

1 hour or less

1 hour to 1 day

1 day to 1 week

1 week to 2 weeks

2 weeks to 1 month

1 to 2months

2 monthsor over

Pain

Mental health

Cancer