Weinstein Et Al

7/29/2019 Weinstein Et Al

1/6

14 AJR:184, January 2005

AJR2005;184:1419

0361803X/05/1840514

American Roentgen Ray Society

Weinstein et al.Clinical Evaluation of Diagnostic Tests

Fundamentals of Clinical Researchfor Radiologists

Susan Weinstein1

Nancy A. Obuchowski2

Michael L. Lieber2

Series editors: Nancy Obuchowski, C. Craig Blackmore,Steven Karlik, and Caroline Reinhold.

This is the 13th in the series designed by the AmericanCollege of Radiology (ACR), the Canadian Association ofRadiologists, and the American Journal of Roentgenology.The series, which will ultimately comprise 22 articles, isdesigned to progressively educate radiologists in themethodologies of rigorous research, from the most basicprinciples to a level of considerable sophistication. Thearticles are intended to complement interactive softwarethat permits the user to work with what he or she haslearned, which is available on the ACR Web site(www.acr.org).

Project coordinator: G. Scott Gazelle, Chair, ACRCommission on Research and Technology Assessment.

Staff coordinator: Jonathan H. Sunshine, Senior Director

for Research, ACR.1Department of Radiology, University of PennsylvaniaMedical Center, Philadelphia, PA 19104. Addresscorrespondence to S. Weinstein.

2Departments of Biostatistics and Epidemiology andRadiology, The Cleveland Clinic Foundation, Cleveland, OH44195.

he evaluation of the accuracy ofdiagnostic tests and the appropri-ate interpretation of test resultsare the focus of much of radiology

and its research. In this article, we first will re-view the basic definitions of diagnostic testaccuracy, including a brief introduction to re-ceiver operating characteristic (ROC) curves.Then we will evaluate how diagnostic testscan be used to address clinical questions suchas Should this patient undergo this diagnos-tic test? and, after ordering the test and see-ing the test result, What is the likelihood thatthis patient has the disease? We will finishwith a discussion of some important conceptsfor designing research studies that estimate orcompare diagnostic test accuracy.

Defining Diagnostic Test Accuracy

Sensitivity and Specificity

There are two basic measures of the inher-ent accuracy of a diagnostic test: sensitivityand specificity. They are equally important,and one should never be reported without theother. Sensitivity is the probability of a posi-tive test result (that is, the test indicates thepresence of disease) for a patient with the dis-ease. Specificity, on the other hand, is theprobability of a negative test result (that is, thetest does not indicate the presence of disease)for a patient without the disease. We use the

term disease here loosely to mean the con-dition (e.g., breast cancer, deep venousthrombosis, intracranial aneurysm) that thediagnostic test is supposed to detect. We cal-culate the tests specificity based on patientswithout this condition, but these patients of-ten have other diseases.

Table 1 summarizes the definitions of sen-sitivity and specificity [1]. The table rowsgive the results of the diagnostic test, as either

positive for the disease of interest or negativefor the disease of interest. The columns indi-cate the true disease status, as either diseasepresent or disease absent. True-positives(TPs) are those patients with the disease whotest positive. True-negatives (TNs) are thosewithout the disease who test negative. False-negatives (FNs) are those with the disease butthe test falsely indicates the disease is notpresent. False-positives (FPs) are those with-out the disease but the test falsely indicatesthe presence of disease. Sensitivity, then, isthe probability of a TP among patients withthe disease (TPs + FNs). Specificity is theprobability of a TN among patients withoutthe disease (TNs + FPs).

Consider the following example. Carpenter

et al. [2] evaluated the diagnostic accuracy ofMR venography (MRV) to detect deepvenous thrombosis (DVT). They performedMRV in a group of 85 patients who presentedwith clinical symptoms of DVT. The patientsalso underwent contrast venography, which isan invasive procedure considered to providean unequivocal diagnosis for DVT (the so-called gold standard test or standard of ref-erence). Of a total of 101 venous systemsevaluated, 27 had DVT by contrast venogra-phy. All 27 cases were detected on MRV;thus, the sensitivity of MRV was 27/27, or100%. Of 74 venous systems without DVT, as

confirmed by contrast venography, threetested positive on MRV (that is, three FPs).The specificity of MRV was 71/74, or 96%specificity (Table 2).

Combining Multiple TestsFew diagnostic tests are both highly sensi-

tive and highly specific. For this reason, pa-tients sometimes are diagnosed using two ormore tests. These tests may be performed ei-

T

Clinical Evaluation of Diagnostic Tests


2/6


3/6

Weinstein et al.


are, however, an infinite number of ROCcurves that could pass through point A, two ofwhich are depicted by dashed curves in Figure2. Some of these ROC curves could be supe-rior to the ROC curve for TK ratio for mostFPRs and others inferior. Based on the singlesensitivity and specificity reported by the in-

vestigator, we cannot determine if the TA ra-tio is superior or inferior in relation to the TK ratio. However, if we had been given theROC curves of both the TA and TK ratio,then we could compare these two diagnostictests and determine, for any range of FPRs,which test is preferred.

This example illustrates the importance ofROC curves and why they have become thestate-of-the-art method for describing the di-agnostic accuracy of a test. In a future modulein this series Obuchowski [6] provides a de-tailed account of ROC curves, including con-structing smooth ROC curves, estimating

various summary measures of accuracy de-rived from them, finding the optimal cutoff onthe ROC curve for a particular clinical appli-cation, and identifying available software.

Interpretation of Diagnostic Tests

Calculating the Positive and Negative

Predictive Values

Clinicians are faced each day with the chal-lenge of deciding appropriate management

for patients, based at least in part on the re-sults of less than perfect diagnostic tests.These clinicians need answers to the follow-ing questions. What is the likelihood that thispatient has the disease when the test result ispositive? and What is the likelihood thatthis patient does nothave the disease when

the test result is negative? The answers tothese questions are known as the positive andnegative predictive values, respectively. Weillustrate these with the following example.

The lemon sign has been described as animportant indicator of spina bifida. Nyberg etal. [7] describe the sensitivity and specificityof the lemon sign in the detection of spina bi-fida in a high-risk population (elevated materialserum -fetoprotein level, suspected hydro-cephalus or neural tube defect, or family historyof neural tube defect). A portion of their data issummarized in Table 4.

Spina bifida occurred in 6.1% (14/229) of

the sample, that is, sample prevalence was6.1%. The lemon sign was seen in 92.9% (13/14) of the fetuses with spina bifida (92.9%sensitivity), and was absent in 98.6% (212/215) of the fetuses without spina bifida(98.6% specificity).

We also can calculate the positive and neg-ative predictive values of the lemon sign fromthe available data. The positive predictivevalue (PPV) is the probability that the fetus

has spina bifida when the lemon sign ispresent. The PPV is calculated as follows:

PPV = TP / (TP + FP) =13 / (13 + 3) 100% = 81.3% (1)

The PPV differs from sensitivity. While thePPV tells us the probability of a fetus with

spina bifida following detection of the lemonsign (that probability is 0.813, or 81.3%), thesensitivity tells us the probability that thelemon sign will be present among fetuseswith spina bifida (probability is 0.929, or92.9%). PPV helps the clinician decide howto treat the patient after the diagnostic testcomes back positive. Sensitivity, on the otherhand, is a property of the diagnostic test andhelps the clinician decide which test to use.

The corollary to the PPV is the negative pre-dictive value (NPV), that is, the probability thatspina bifida will not be present when the lemonsign is absent. The NPV is calculated as follows:

NPV = TN / (TN + FN) =212 / (212 + 1) 100% = 99.5% (2)

If the lemon sign is absent, there is a 99.5%chance that the fetus will not have spina bi-fida. The NPV is different from the testsspecificity. Specificity tells us the probabilitythat the lemon sign will be absent among fe-tuses without spina bifida (that probability is0.986, or 98.6%).

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0

False-Positive Rate

Sensitivity

superior ROC curve

T-K ratio ROC curve

inferior ROC curve

A

O

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0

False-Positive Rate

Sensitivity

cutoff=0.11

cutoff=0.38

Perfect Test

chan

cedia

gonal

Fig. 1.10 pairs of sensitivity and specificity as calculated in Table 3. The y-axis isthe sensitivity and thex-axis is 1 minus the specificity, or the false-positive rate (FPR).Receiver operating characteristic (ROC) curve is created by connecting points withline segments.

Fig. 2.Single cutoff point (labeled A) in relation to the receiver operating charac-teristic (ROC) curve for TK (tumor enhancement to normal kidney enhancement)ratio.


4/6


AJR:184, January 2005 17

The PPV and NPV can also be calculatedfrom Bayes theorem. Bayes theorem allowsus to compute the PPV and NPV from esti-mates of the tests sensitivity and specificity,and the probability of the disease before thetest is applied. The latter is referred to as thepretest probability and is based on the patients

previous medical history, previous and recentexposures, current signs and symptoms, andresults of other screening and diagnostic testsperformed. When this information is unknownor when calculating the PPV or NPV for a pop-ulation, the prevalence of the disease in thepopulation is used as the pretest probability.The PPV and NPV, then, are called posttestprobabilities (also, revised or posterior proba-bilities), and represent the probability of thedisease after the test result is known.

Letp denote the pretest probability of dis-ease, and SE and SP the sensitivity and spec-ificity of the diagnostic test. Recalling the

expression for a conditional probability (seemodule 10 [8]),

PPV = P(disease| + test) =[SEp] / [SE p + (1 SP) (1p)] (3)

NPV = P(no disease| test) =[SP (1 p)] / [SP (1 p) + (1 SE) p] (4)

Thus, the posttest probability of disease forany patient can be calculated if one knows the

accuracy of the test and the patients pretestprobability of disease.

The PPV and NPV can vary markedly, de-pending on the patients pretest probability, orprevalence of disease in the population. In theNyberg et al. [7] study the prevalence rate ofspina bifida in their high risk sample was6.1%. In the general population, however, theprevalence of spina bifida is much less, about0.1%. Filly [9] studied the predictive abilityof the lemon sign in the general population.He assumed that the sensitivity of the lemonsign was 90.0% and the specificity was 98.6%(very similar to that in Nybergs small study,92.9% and 98.6%, respectively). In a sampleof 10,000 fetuses from a low-risk population(see Table 5), Filly showed that the positivepredictive value is only 6%. This is in contrastto the PPV of 81.3% in the Nyberg study. Thedrastic difference in PPVs is due to the differ-

ent prevalence rates of spina bifida in the twosamples, 6.1% in Nybergs and 0.1% inFillys. Thus, while a high-risk fetus with alemon sign may have an 81% chance of hav-ing spina bifida, a low risk fetus with alemon sign has a 94% chance of being per-

fectly normal [9]. This example illustratesthe importance of reporting the pretest proba-bility or prevalence rate of disease wheneverone presents a PPV or NPV.

Rationale for Ordering a Diagnostic Test

The previous section described how clini-

cians can use the results of a diagnostic test toplan a patients management. Lets back up a bitin the clinical decision-making process and lookat the rationale for ordering a diagnostic test.

In the simplest scenario (ignoring mone-tary costs, insurance reimbursement rates,etc.), there are three pieces of information thata clinician needs to determine whether a diag-nostic test should or should not be ordered:

1. From the patients previous medical his-tory, previous and recent exposures, currentsigns and symptoms, and results of otherscreening and diagnostic tests performed,what is the probability that this patient has the

disease (that is, the pretest probability)?2. How accurate (sensitivity and specific-ity) is the diagnostic test being considered?

3. Could the results of this test affect thepatients management?

In the previous section, we saw how thepretest probability and the tests sensitivityand specificity fit into Bayes theorem to tellus the posttest probability of disease. We alsosaw, even for a very accurate test, how the

PPV can be quite low when the pretest prob-ability is low. The clinician ordering a testneeds to consider how the patient will bemanaged if the test result is negative versus ifthe test result is positive. If the probability ofdisease will still be low after a positive test,then the test may have no impact on the pa-tients management.

An example is screening for intracranialaneurysms in the general population. Theprevalence of aneurysms is low, maybe 1%, inthe general population. Even though mag-netic resonance angiography (MRA) mayhave excellent accuracy, say 95% sensitivityand specificity, the PPV is still quite low, 0.16(16%) from equation 3. Considering the non-trivial risks of invasive catheter angiography(which is the usual presurgical tool) [10], theclinician may decide that even after a positiveMRA, the patient should not undergo catheter

angiography. In this scenario, the clinicianmay decide not to order the MRA, given thatits result, either positive or negative, will notimpact the patients management.

Designing Studies to Estimate and

Compare Tests Diagnostic Accuracy

As with all new medical devices, treat-ments, and procedures, the efficacy of diag-nostic tests must be assessed in clinicalstudies. In the second module of this seriesJarvik [11] described six levels of diagnosticefficacy. Here, we will focus on the secondlevel, which is the stage at which investigators

assess the diagnostic accuracy of a test.

Phases in the Assessment of

Diagnostic Test Accuracy

There typically are three phases to the as-sessment of a diagnostic tests accuracy [3].The first is the exploratory phase. It usually isthe first clinical study performed to assess theefficacy of a new diagnostic test. These tendto be small, inexpensive studies, typically in-volving 10 to 50 patients with and without thedisease of interest. The patients selected forthe study samples often are cases with classi-cal overt disease (for example, symptomatic

lung cancer) and healthy volunteer controls.If the test results of these two populations donot differ, then it is not worth pursuing the di-agnostic test further.

The second phase is the challenge phase.Here, we recognize that a diagnostic testssensitivity and specificity can vary with theextent and stage of the disease, and the co-morbidities present. Thus, in this phase we se-lect patients with subtle, or early disease, and

TABLE 4Lemon Sign Versus SpinalCord Defect in FetusesPrior to 24 Weeks

Lemon SignSpinaBifida

No SpinaBifida

Total

+ 13 3 16

1 212 213

Total 14 215 229

Note.SE = 92.9%, SP = 98.6%, PPV = 81.3%,NPV= 99.99%, prevalence= 6.1%.

TABLE 5The PPV of the LemonSign in the GeneralPopulation

Lemon SignSpinaBifida

No SpinaBifida

Total

+ 9 140 149 1 9,850 9,851

Total 10 9,990 10,000

Note.SE = 90.0%, SP = 98.6%, PPV = 6.0%,NPV= 99.99%, prevalence= 0.1%.


5/6

Weinstein et al.


with comorbidities that could interfere withthe diagnostic test [12]. For example, in astudy to assess the ability of MRI to detectlung cancer, the study patients might includethose with small nodules (3 cm), and patientswith nodules and interstitial disease. The con-trols might have diseases in the same ana-tomic location as the disease of interest, forexample, interstitial disease but no nodules.These studies often include competing diag-nostic tests to compare their accuracies withthe test under evaluation. ROC curves aremost often used to assess and compare the

tests. If the diagnostic test shows good accu-racy, then it can be considered for the thirdphase of assessment.

The third phase is the advanced phase.These studies often are multicenter studies in-volving large numbers of patients (100 ormore). The patient sample should be repre-sentative of the target clinical population. Forexample, instead of selecting patients withknown lung cancer and controls without can-cer, we might recruit patients presenting totheir primary care physician with a persistentcough or bloody sputum. Further testing andfollow-up will determine which patients have

lung cancer and which do not.It is from this third phase where we obtain re-

liable estimates of a tests accuracy for the targetclinical population. Estimates of accuracy fromthe exploratory phase usually are too optimisticbecause the sickest of the sick are comparedwith the wellest of the well [13]. In contrast,estimates of accuracy from the challenge phaseoften are too low because the patients are excep-tionally difficult to diagnose.

Common Features of Diagnostic

Test Accuracy Studies

The studies in the three phases differ interms of their objectives, sampling of pa-tients, and sample sizes. There are, how-ever, some common features to all studiesof diagnostic test accuracy, as summarizedin Table 6. We elaborate here on a few im-portant issues.

Studies of diagnostic test accuracy requireboth subjects with and without the disease ofinterest. If one of these populations is not rep-resented in the study, then either sensitivity or

specificity cannot be calculated. We stressthat reporting one without reference to theother is uninformative and often misleading.The number of patients needed for diagnosticaccuracy studies depends on the phase of thestudy, the clinical setting in which the test willbe applied (for example, screening or diag-nostic), and certain characteristics of the pa-tients and test itself (for example, does the testrequire interpretation by human observers?).Statistical methods are available for deter-mining the appropriate sample size for diag-nostic accuracy studies [3, 14].

Studies of diagnostic test accuracy require a

test or procedure for determining the true dis-ease status of each patient. This test or proce-dure is called the gold standard (or standardof reference, reference standard, particularlywhen there is no perfect gold standard). Thegold, or reference, standard must be conductedand interpreted blinded to the diagnostic test re-sults to avoid bias. Common standards of refer-ence in radiology studies are surgery, pathologyresults, and clinical follow-up. For example, in

the study of Carpenter et al. [2] of the accuracyof MR venography for detecting deep venousthrombosis, contrast venography was used asthe reference standard. Sometimes a study usesmore than one type of reference standard. Forexample, in a study to assess the accuracy ofmammography, patients with a suspicious le-sion on mammography might undergo core bi-opsy and/or surgery, whereas patients with anegative mammogram would need to be fol-lowed for 2 years either to confirm that the pa-tient was cancer free or to detect missed cancerson follow-up screenings. Note that when usingdifferent reference standards for patients withpositive and negative test results, it is importantthat all the reference standards are infallible, ornearly so. One form of workup bias occurswhen patients with one test result undergo a lessrigorous reference standard than patients with adifferent test result [3].

Determining the appropriate referencestandard for a study often is the most difficultpart of designing a diagnostic accuracy study.Reference standards should be infallible, ornearly so. This is difficult, however, becauseeven pathology is not infallible, as it is an in-terpretive field relying on subjective assess-ment from human observers with varyingskill levels. One such example is the readervariability in pathologic interpretation of bor-derline intraductal breast carcinoma versusatypical ductal carcinoma. While some pa-thologists may interpret the lesion as intra-ductal cancer, others may interpret the same

lesion as atypical ductal hyperplasia. Whileoften we have to accept that a reference stan-dard is not perfect, it is important that it benearly infallible. If the reference standard isnot nearly infallible, then imperfect gold stan-dard bias can lead to unreliable and mislead-ing estimates of accuracy. Zhou et al. [3]discuss in detail imperfect gold standard biasand possible solutions.

In other situations, no reference standard isavailable (for example, epilepsy) or it is un-ethical to subject patients to the referencestandard because it poses a risk (for example,an invasive test such as catheter angiography).

In these situations, we at least can correlatethe test results to other tests findings and toclinical outcome, even if we cannot report thetests sensitivity and specificity.

It is neveran option to omit from the calcu-lation of sensitivity and specificity those pa-tients without a diagnosis confirmed by areference standard. Such studies yield errone-ous estimates of test accuracy due to a form ofworkup bias called verification bias [17, 18].

TABLE 6 Common Features of Diagnostic Test Accuracy Studies

Feature Explanation

Two samples of patients One sample of patients with and one sample without the disease areneeded to estimate both sensitivity and specificity.

Well-defined patient samples Regardless of the sampling scheme used to obtain patients for thestudy, the characteristics of the study patients (e.g., age, gender,

comorbidities, stage of disease) should be reported.

Well-defined diagnostic test The diagnostic test must be clearly defined and applied in the samefashion to all study patients.

Gold standard/reference standard The true disease status of each study patient must be determined bya test or procedure that is infallible, or nearly so.

Sample of interpreters If the test relies on a trained observer to interpret it, then two or moresuch observers are needed to independently interpret the test [15].

Blinded interpretations The gold standard should be conducted and interpreted blinded tothe results of the diagnostic test, and the diagnostic test should beperformed and interpreted blinded to the results of the goldstandard.

Standard reporting of findings The results of the study should be reported following publishedguidelines for the reporting of diagnostic test accuracy [16].


6/6


AJR:184, January 2005 19

This is one of the most common types of biasin radiology studies [19] and is counterintui-tive. Investigators often believe they are get-ting more reliable estimates of accuracy byexcluding cases where the reference standardwas not performed. If, however, the diagnos-tic test results were used in the decision ofwhether to perform the reference standardprocedure, then verification bias most likelyis present. For example, if the results of MRvenography are used to determine which pa-tients will undergo contrast venography, andif patients who did not undergo contrastvenography are excluded from the calcula-tions of the tests accuracy, then verificationbias exists. Zhou et al. [3] discuss verificationbias from a statistical standpoint and offer avariety of solutions.

Summary

We conclude with a summary of five key points

in the clinical evaluation of diagnostic tests:1. Sensitivity and specificity always should

be reported together.2. ROC curves allow a comprehensive as-

sessment and comparison of diagnostic testaccuracy.

3. PPV and NPV cannot be interpreted cor-rectly without knowing the prevalence of dis-ease in the study sample.

4. Patients who did not undergo the referencestandard procedure should never be omittedfrom studies of diagnostic test accuracy.

5. Published guidelines should be followedwhen reporting the findings from studies ofdiagnostic test accuracy.

References1. Gehlbach SH. Interpretation: sensitivity, specific-

ity, and predictive value. In: Gehlbach SH, ed.In-

terpreting the medical literature. New York:

McGraw-Hill, 1993:129139

2. Carpenter JP, Holland GA, Baum RA, Owen RS,

Carpenter JT, Cope C. Magnetic resonance

venography for the detection of deep venous

thrombosis: comparison with contrast venogra-

phy and duplex Doppler ultrasonography.J Vasc

Surg 1993;18:734741

3. Zhou XH, Obuchowski NA, McClish DK. Statis-tical methods in diagnostic medicine. New York:

Wiley & Sons, 2002

4. Herts BR, Coll DM, Novick AC, Obuchowski N,

Linnell G, Wirth SL, Baker ME. Enhancementcharacteristics of papillary renal neoplasms re-

vealed on triphasic helical CT of the kidneys.AJR

2002;178:367372

5. Metz CE. ROC methodology in radiological im-

aging.Invest Radiol 1986;21:7207336. Obuchowski NA. Receiver operating characteris-

tic (ROC) analysis.AJR 2005(in press)

7. Nyberg DA, Mack LA, Hirsch J, Mahony BS. Ab-

normalities of fetal cranial contour in sonographic

detection of spina bifida: evaluation of the

lemon sign.Radiology 1988;167:387392

8. Joseph L, Reinhold C. Introduction to probability theory

and sampling distributions.AJR2003;180:9179239. Filly RA. The lemon sign: a clinical perspec-

tive.Radiology 1988;167:573575

10. Levey AS, Pauker SG, Kassirer JP, et al. Occult

intracranial aneurysms in polycystic kidney dis-

ease: when is cerebral arteriography indicated?N

Engl J Med1983;308:986994

11. Jarvik JG. The research framework. AJR

2001;176:873877

12. Ransohoff DJ, Feinstein AR. Problems of spec-

trum and bias in evaluating the efficacy of diag-

nostic tests.N Engl J Med1978;299:926930

13. Sox Jr HC, Blatt MA, Higgins MC, Marton KI.

Medical decision making. Boston: Butterworths-Heinemann, 1988

14. Beam CA. Strategies for improving power in diag-

nostic radiology research.AJR 1992;159:631637

15. Obuchowski NA. How many observers in clinical stud-

ies of medical imaging?AJR2004;182:867869

16. Bossuyt PM, Reitsma JB, Bruns DE, et al. Toward

complete and accurate reporting of studies of di-

agnostic accuracy: the STARD initiative. Acad

Radiol 2003;10:664669

17. Begg CB, McNeil BJ. Assessment of radiologic

tests, control of bias, and other design consider-

ations.Radiology 1988;167:565569

18. Black WC. How to evaluate the radiology litera-

ture.AJR 1990;154:1722

19. Reid MC, Lachs MS, Feinstein AR. Use of method-

ologic standards in diagnostic test research: getting

better but still not good.JAMA1995;274:645651

1. Introduction, which appeared in February 20012. The Research Framework, April 20013. Protocol, June 20014. Data Collection, October 20015. Population and Sample, November 20016. Statistically Engineering the Study for Success, July 20027. Screening for Preclinical Disease: Test and Disease

Characteristics, October 2002

8. Exploring and Summarizing Radiologic Data, January 20039. Visualizing Radiologic Data, March 2003

10. Introduction to Probability Theory and SamplingDistributions, April 2003

11. Observational Studies in Radiology, November 200412. Randomized Controlled Trials, December 2004

The readers attention is directed to earlier articles in the Fundamentals of Clinical Research series:

Weinstein Et Al

Documents

Transcript of Weinstein Et Al