Journal of Biomedical Informatics - Stanford University · Received 20 March 2016 Revised 22 June...

Journal of Biomedical Informatics 62 (2016) 224–231

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

Using automatically extracted information from mammography reportsfor decision-support

http://dx.doi.org/10.1016/j.jbi.2016.07.0011532-0464/� 2016 Elsevier Inc. All rights reserved.

⇑ Corresponding author.E-mail address: [email protected] (D.L. Rubin).URL: http://rubin.web.stanford.edu/ (D.L. Rubin).

Selen Bozkurt a, Francisco Gimenez b, Elizabeth S. Burnside c, Kemal H. Gulkesen a, Daniel L. Rubin b,⇑aAkdeniz University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Antalya, TurkeybDepartment of Radiology and Medicine (Biomedical Informatics Research), Stanford University, Richard M. Lucas Center, 1201 Welch Road, Office P285, Stanford, CA94305-5488, United StatescDepartment of Radiology, University of Wisconsin, Madison, WI, United States

a r t i c l e i n f o a b s t r a c t

Article history:Received 20 March 2016Revised 22 June 2016Accepted 2 July 2016Available online 4 July 2016

Keywords:Breast Imaging Reporting and Data System(BI-RADS)Information extractionNatural language processingDecision support systems

Objective: To evaluate a system we developed that connects natural language processing (NLP) for infor-mation extraction from narrative text mammography reports with a Bayesian network for decision-support about breast cancer diagnosis. The ultimate goal of this system is to provide decision supportas part of the workflow of producing the radiology report.Materials and methods: We built a system that uses an NLP information extraction system (which extractBI-RADS descriptors and clinical information from mammography reports) to provide the necessaryinputs to a Bayesian network (BN) decision support system (DSS) that estimates lesion malignancy fromBI-RADS descriptors. We used this integrated system to predict diagnosis of breast cancer from radiologytext reports and evaluated it with a reference standard of 300 mammography reports. We collected twodifferent outputs from the DSS: (1) the probability of malignancy and (2) the BI-RADS final assessmentcategory. Since NLP may produce imperfect inputs to the DSS, we compared the difference between usingperfect (‘‘reference standard”) structured inputs to the DSS (‘‘RS-DSS”) vs NLP-derived inputs (‘‘NLP-DSS”)on the output of the DSS using the concordance correlation coefficient. We measured the classificationaccuracy of the BI-RADS final assessment category when using NLP-DSS, compared with the ground truthcategory established by the radiologist.Results: The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of0.004 ± 0.025. The concordance correlation of these paired measures was 0.95. The accuracy of theNLP-DSS to predict the correct BI-RADS final assessment category was 97.58%.Conclusion: The accuracy of the information extracted from mammography reports using the NLP systemwas sufficient to provide accurate DSS results. We believe our system could ultimately reduce the vari-ation in practice in mammography related to assessment of malignant lesions and improve managementdecisions.

� 2016 Elsevier Inc. All rights reserved.

1. Introduction

Screening mammography is a key approach in the early detec-tion of breast cancer [1]. It is limited by a risk of causing false-positive findings leading to potentially unnecessary follow-upimaging and biopsies [2]. In addition, it has been shown thatlarge-scale screening is subject to variability due to the inherentsubjective nature of evaluating mammograms [3–10]. There areseveral sources of radiologist variability, such as age, experienceand training of the radiologist, the number of mammograms per-

formed, time between mammograms, and the availability of previ-ous studies [6,8,10–12]. The Mammography Quality Standards Act(MQSA) attempts to reduce the variability of mammography prac-tice by requiring radiologists to report their outcomes [13]. Basedon those outcomes, studies have been carried out to identify par-ticular performance levels for mammography interpretations[13–14] as targets for improving quality medical practice. Forinstance, Positive predictive value (PPV) of biopsy recommenda-tion less than 20% or greater than 40%, and cancer detection rateless than 2.5 per 1000 interpretations were metrics identified asindicating low performance [14].

In order to advance quality and to improve the overall perfor-mance of radiologists, decision support systems (DSS) that usequantitative decision-making methods [15,16] have been

http://crossmark.crossref.org/dialog/?doi=10.1016/j.jbi.2016.07.001&domain=pdf

http://dx.doi.org/10.1016/j.jbi.2016.07.001

mailto:[email protected]

http://rubin.web.stanford.edu/

http://dx.doi.org/10.1016/j.jbi.2016.07.001

http://www.sciencedirect.com/science/journal/15320464

http://www.elsevier.com/locate/yjbin

S. Bozkurt et al. / Journal of Biomedical Informatics 62 (2016) 224–231 225

advocated. Despite their potential to enhance clinical practice, par-ticularly in mammography [16–19], DSS are not widely adopted inthe clinic. A review by Garg [20] summarized several key barriersto implementation of DSS, including failure of practitioners touse the system, poor usability or integration into practitionerworkflow [21], or practitioner non-acceptance with the computerrecommendations. In their review, studies in which users wereautomatically prompted to use the system had better performancecompared with studies in which users were required to activelyinitiate the system to receive decision support [20]. A particularchallenge is that most current DSS require a parallel workflow, inwhich the clinician enters pertinent findings into the system aftercreating a report that already documents the same findings[18,19,22]. This duplicative data entry activity is both time-inefficient and error-prone, and could explain some of the reasonsthat DSS is not yet widely used in mammography practice, despitethe potential advantages.

A strategy for deploying DSS into the clinical workflow is to pro-vide automated structured entry of the necessary data into theDSS. Although the need for this strategy is widely acknowledged[20,23,24], few DSS provide automated data entry. In particular,in mammography, the narrative radiology report is commonlythe only format used for recording and communicating the imag-ing results to the referring physician or others. Even when auditingrequirements lead many radiology practices to use structuredreporting systems, narrative reports are invariably produced asthe final product to preserve effective communication with refer-ring physicians. This reduces the availability of machine-interpretable structured data for direct use by DSS. In addition,the high-volume and fast pace of medical practice puts emphasison efficiency, hindering efforts for more detailed structured cap-ture of report data in routine clinical practice; in fact, narrativereporting using voice recognition greatly dominates radiologyreporting [25].

Given that most radiology results are in a narrative, unstruc-tured format, NLP tools that extract structured information fromnarrative radiology reports and directly feed that information intoa DSS could close a critical gap hindering the translation of DSS intothe clinical workflow [24,26,27], ultimately resulting in substantialreductions in variation in practice and improvement in the qualityof care. Many prior works have shown the utility of NLP to extractinformation from text reports and to represent that information ina computable format suitable for DSS [27–31]. The structured out-put of these systems was used for automated classification to infera variety of clinical conditions were present in the text [27,28,32–34]. The focus of these prior works was on automatic text classifi-cation, rather than accurate estimation of the probability of disease(such provided by a Bayesian Network using inputs from theNLP)—a critical task in mammography, in which decisions aboutthe patient management are made based on the probability of dis-ease. In this study, we produce a DSS that estimates the probabilityof disease directly from NLP extraction of the pertinent informa-tion from mammography text reports. The ultimate goal is to pro-vide decision support to radiologists during the routine workflowof dictating narrative reports.

A particular challenge to using NLP to provide the structuredinformation needed by a DSS is that the performance of the latterdepends on the accuracy of its inputs. Although many NLP systemshave excellent performance in their analysis of unstructured text[26,27,34–37], no NLP system has perfect performance. It is thusimperative to understand the sensitivity of the DSS to imperfectinput from the NLP system. However, to our knowledge, no priorstudies have measured the effect of imperfect NLP informationextraction on the performance of a DSS.

The first goal of our work is to evaluate whether an NLP systemwe developed has sufficient accuracy to create an NLP-driven

mammography reporting decision support system (NLP-DSS) that(1) extracts from narrative mammography reports structuredinformation about breast lesions and clinical information aboutthe patient, and (2) inputs this structured information into a Baye-sian network (BN) to provide decision support about the diagnosisbased on the information in the radiology report, computed imme-diately after the radiologist completes the report. The second goalof our work is to assess the robustness of the performance of theDSS given imperfect NLP extraction from the input narrative text.

2. Materials and methods

2.1. System overview

NLP-driven mammography reporting decision support system(NLP-DSS) consists of two main parts: (1) an NLP system for auto-matic annotation and extraction of data required for decision sup-port in mammography reports and (2) a decision support model toprovide decision support about the diagnosis based on the infor-mation in the radiology report. Since the NLP system and the deci-sion support model are integrated into the NLP-DSS, this systemcan provide ‘‘real-time” decision support to radiologists, as soonas they have completed dictating their report, providing the prob-ability of malignancy and enabling them to determine if their con-clusions are consistent with the predicted diagnoses. Thus, thisprovides a workflow that could permit incorporating decision sup-port into the radiology interpretation workflow (occurring con-comitant with reporting), without requiring a parallel data entryprocess (Fig. 1).

2.1.1. Natural language processing system to extract data required forNLP-DSS

We previously developed an NLP system for automatic annota-tion and extraction of imaging observations that characterizebreast lesions, the locations of the lesions, and other attributes ofbreast lesions described in mammography reports [38]. We usedBI-RADS [39] to provide a controlled terminology for the termsused in mammography reports for describing named entities(imaging observations and locations of lesions). The BI-RADS termsare useful to unify the variety of terminological variants occurringin texts that describe these named entities; using terminologies (orontologies) such as BI-RADS is a common approach for mappingtextual descriptions to canonical meanings [40]. The BI-RADS ter-minology contains descriptors, which are specialized terms thatdescribe breast density and lesion features (types of imagingobservations). Since BI-RADS is not distributed in a structured for-mat, we previously created a simple ontology structure of this ter-minology for our system (‘‘BI-RADS ontology”) [38]. Our NLPsystem takes as input a free text mammography report and pro-duces as output a set of information frames summarizing eachlesion described in the report and its attributes, with all termsbeing normalized to the BI-RADS terminology [38] (Fig. 2a). Thesystem performed extraction of imaging observations with theirmodifiers from text reports with precision = 94.9%, recall = 90.9%,and F-measure = 92% [38].

Since the decision model requires clinical information in addi-tion to imaging observations, we extended our NLP system by add-ing a component to detect and extract personal and family historyof breast cancer. This clinical information is reported in the historysection of the mammography reports, so our module segmentedthis section of the report and extracted the information using asimilar rule-based approach (Fig. 2b) that we described previously[38]. We used the ConText algorithm [41], which determineswhether the clinical conditions mentioned in clinical reports arenegated, hypothetical, historical, or experienced by someone other

Fig. 1. Decision support tools developed and their relationship to the radiology workflow.

Fig. 2a. Output from NLP system produced from the input sentence, ‘‘There is a 1 cm oval nodular density with an obscured margin in the right breast in the anterior depth.”The upper rows represent entities whose modifier values are shown in lower rows.

Fig. 2b. Output from NLP system on GATE NLP GUI.

226 S. Bozkurt et al. / Journal of Biomedical Informatics 62 (2016) 224–231

than the patient. In addition, to run in the decision model, we com-bined the associated lesions such as ‘‘there is skin thickening asso-

ciated with this mass” as an abnormality and its characteristics.This context is needed in order to appropriately report these


clinical variables as being observed or not observed in the decisionsupport model.

2.1.2. Bayesian Network (BN) decision support modelWe used a previously developed Bayesian network decision-

support system for this study [42]. This BN (‘‘diagnosis-prediction BN”) contains a variable for each of the BI-RADS descrip-tors, as well as variables capturing the clinical variables of personaland family history of breast cancer. Given a set of BI-RADS descrip-tors describing a single lesion in a mammography report and theassociated patient data provided in that report as input to theBN, the diagnosis-prediction BN outputs the probability of malig-nancy given the observed findings (Fig. 3). We chose this modelfor its strong performance in this diagnostic task (Area under curve(AUC) = 0.96, sensitivity = 90.0%, specificity = 93.0%) [42]. In addi-tion to predicting the probability of lesion malignancy, we modi-fied the original BN model to create a second BN model thatpredicts the BI-RADS final assessment category (hereafter referredto as ‘‘BI-RADS category”). Among the clinical variables in this newmodel (age, hormone therapy, personal and family history ofbreast cancer), very few mammography reports described the age(14 reports) and hormone treatment history (77 reports); thus,we removed these two variables from our decision model. As a pre-processing step, the continuous variable ‘‘lesion size” was catego-rized as small (size < 3 cm) and large (sizeP 3 cm).

2.2. Integrating NLP and decision support

In order to integrate the NLP system and the BN model, weadapted our previous NLP system so that its output could bedirectly consumed by this Bayesian Network model [18,22,42,43].To do so, we built an interface to map the outputs of our NLP sys-tem to the BN by mapping the information frames output by ourNLP system to the state variable inputs of the BN. Our NLP-DSS isthus an integrated system, taking free text radiology reports asinput, and outputting the probability of malignancy for each lesiondescribed in the mammography report (the probability of malig-nancy is specific to each lesion in the mammogram; it is possibleto integrate the results of multiple lesions into a composite proba-bility score, but in practice, each lesion is evaluated separately).

2.3. Evaluation

2.3.1. Reference standard inputs for DSSThe output of the DSS depends on the accuracy of its inputs, and

since our NLP system may produce imperfect inputs to DSS, wedeveloped a reference standard set of ‘‘ground truth” inputs forevaluating the DSS. We randomly selected 300 mammographyreports from a report database from an academic radiology prac-tice. A fellowship-trained, subspecialty-expert breast-imagingradiologist reviewed each of these 300 reports to determine theset of BI-RADS descriptors that described each breast lesion inthe radiology reports; this served as our reference standard set ofBI-RADS descriptors for testing the DSS (called the RS-DSS), to becompared with results obtained when using NLP for the input(NLP-DSS).

Fig. 3. Decision support

2.3.2. Evaluation of clinical history extractionSince in this work we extended our previous NLP system to

extract additional information (specifically, the clinical history),we evaluated the accuracy of this task in terms of precision andrecall by comparing the clinical history information produced byour system with that determined by the radiologist who reviewedthe reports. For this evaluation, we used the same 300 mammogra-phy reports which we previously used to evaluate our NLP system[38].

2.3.3. Evaluation of NLP-DSSThe primary outcome for our evaluation of the NLP-DSS was

whether the probability of malignancy and the BI-RADS category(which represents a categorization of the probability of malig-nancy) matched that produced by the same decision model runon the reference standard. Since the reference standard report setdid not include any report classified as BI-RADS 6, our evaluationdid not include any BI-RADS 6 cases. We assessed the agreementin two ways: (1) agreement in absolute value of the probabilityof malignancy and (2) agreement of the BI-RADS category. Weassessed the latter because it is used for clinical decision making[44]. For assessing the probability of malignancy, the diagnosis-prediction BN was used as the decision model in the NLP-DSS,and for assessing the BI-RADS category, the BI-RADS category-prediction BN was used as the decision model.

We assessed the agreement in the probability of malignancyusing the concordance correlation coefficient [45], comparing theprobability of malignancy predicted by our system using free textreports and that determined by the decision model applied tothe same cases using the reference standard data as input. Atwo-tailed p value less than 0.05 was considered statisticallysignificant.

To assess the accuracy of the NLP-DSS in terms of the qualitativeBI-RADS category, we created confusion matrices and calculatedthe percentage agreement in the BI-RADS categories between thatpredicted by our NLP-DSS and that determined by the decisionmodel applied to the same cases in the reference standard. Weconducted the evaluation in two ways: (1) personal and cancer his-tory variables were included in the BN, and (2) personal and cancerhistory nodes were not included in the BN.

3. Results

3.1. Cancer history extraction

Of the 300 reports in which we evaluated our patient cancerhistory extraction module, family cancer history were reported in90 (30%) reports, while personal cancer history was reported in177 (59%) reports. Among those, 88 of the family history (97.7%)and 175 of the personal cancer history (98.8%) extractions by oursystem were true positives. Two extractions were false negativesfor both family and personal cancer histories (2.2% and 1.1%,respectively). In addition, two family cancer history (2.2%) andthree personal cancer history (1.7%) detected by our NLP systemwere false positives. For personal cancer history extraction, theprecision of extracting the breast cancer history using our system

system flowchart.


was 98.3% and recall was 98.8%. For family history extraction, theprecision and recall of extracting the breast cancer history usingour system was 97.7% for each. The accuracy of the remainder ofthe information extraction tasks of our NLP system has beenreported previously [38].

3.2. NLP-DSS evaluation

Among the 300 mammography reports in our dataset, therewere 702 different breast lesions. The probability of malignancywas calculated for each lesion based on descriptors and clinical val-ues in the reference standard and based on of the values of these

Fig. 4. Correlation between the probabilities calculated from NLP system outputand reference data set.

Table 1Evaluation of NLP-DSS based on accuracy of assigning the correct BI-RADS categories.

BI-RADS categoriesassigned for Test set

BI-RADS categories assigned for Reference set

History nodes included

B 0 B 1–2 B 3 B 4

BI-RADS 0 55 0 0 0BI-RADS 1–2 3 484 3 3BI-RADS 3 0 1 28 0BI-RADS 4 2 1 0 119BI-RADS 5 0 0 0 0Total 60 486 31 122

Table 2Reasons of inconsistencies among BI-RADS categories.

Reference Test Reason

The reasons of the differences among BI-RADS categories for Reference data set and TestB0 B1–2 Lymph node and ‘‘Stability” info

Lymph node was not detected b

B0 B4 Lymph node and ‘‘Stability” info‘‘Density” and ‘‘Shape” informat

B1–2 B3 Skin Thickening and ‘‘Stability”B1–2 B4 Skin Lesion was not detected byB3 B1–2 Calcification was not found by NB4 B1–2 Calcification was not found by N

variables derived from our NLP system. We thus had paired prob-abilities of malignancy for each of the 702 breast lesions. The con-cordance correlation between these two sets of probabilities was0.952 (Fig. 4).

We also compared the BI-RADS categories for each of the 702breast lesions derived from the reference standard and derivedfrom the output of the NLP system. The results were summarizedin Table 1. Accuracy rate of the BN outputs for each setting werecalculated as 98.14% (history nodes included) and 98.15% (historynodes not included).

The common reasons for the inconsistencies are summarized inTable 2. Lymph Nodes, stability, and calcification extraction prob-lems were the primary cause of disagreement between RS-DSSand NLP-DSS.

4. Discussion

In this paper, we show that the output of an NLP informationextraction system applied to mammography reports can providethe necessary inputs to a DSS, and the outputs of this DSS couldultimately guide decision making at the time of dictating thereports. Invoking decision support inference directly from narra-tive radiology reports could ultimately enable its deployment inthe routine clinical workflow of report generation. The results ofthe evaluation of our NLP-DSS are promising; we found excellentagreement in DSS outputs (probability of malignancy and the BI-RADS category) when using the inputs derived from our NLP sys-tem and those from the reference standard.

A unique aspect of our work is that in addition to assessing theaccuracy of the information extraction performed by our system,we assess the impact the imperfect extraction by the NLP systemon the decision support outputs (predictions of the probability ofcancer and the BI-RADS category). We found that the decision sup-port outputs when using our NLP system, compared with thatwhen using hand-curated structured data (the reference standard),are accurate and highly correlated. To our knowledge, this is thefirst study to undertake an evaluation of the impact of imperfec-tions in automated information extraction on the accuracy of DSSoutput.

History nodes not included

B 5 B 0 B 1–2 B 3 B 4 B 5

0 60 0 0 0 00 3 487 3 3 00 0 1 22 0 00 3 0 0 117 03 0 0 0 0 33 66 488 25 120 3

Frequency

setrmation of the lesion was not detected by NLP 1y NLP 2

rmation of the lesion was not detected by NLP 1ion of the lesion was not detected by NLP 1

information of the lesion was not detected by NLP 1NLP 1LP 3LP 3


The concept of using narrative text as the input to decision sup-port by integrating NLP and decision support systems has beenpreviously described [27,34,46–52]. In their review, Demner-Fushman et al. categorize NLP–DSS systems, including specializedsystems dedicated to a specific task, a set of NLP modules run bya DSS system, and stand-alone systems/services that take clinicaltext as input and generate output to be used in DSS systems[24]. Under this framework, our NLP-DSS is a stand-alone system,developed for a specific task that could be customized andextended for different tasks. The Demner-Fushman et al. reviewalso points out that an NLP system could process clinical reportsin real-time, and the NLP output can then be used by a DSS to pro-vide decision support during the clinical workflow [24]. In fact, ourNLP-DSS, if integrated with a voice-recognition reporting applica-tion, provides an example scenario of this, in which decision sup-port is provided immediately after completion of the radiologyreport to enable the radiologist to consider whether their assess-ment of the likelihood of malignancy of breast cancer in patientconcurs with that predicted by a decision model. With a similarapproach in different domain [52], Evans et al. developed an auto-mated identification and predictive risk report for hospitalizedheart failure patients, and the addition of NLP increased the iden-tification HF patients.

Since the narrative text of radiology reports lacks the structureand controlled language needed to directly support DSS, such asBayesian models or other computer reasoning systems that requiresuch inputs [53], NLP methods have been proposed to bridge thegap from unstructured narrative text to structured input for deci-sion support [24,27,31]. A number of researchers have investigatedNLP methods to automatically recognize and encode the conceptsconveyed in medical texts [27,29,31,34–37,48,52,54–60]. Most ofthose NLP systems have been created as general purpose systemsthat attempt to extract diseases and findings that are commonlydiscussed in a medical texts. In some studies, the NLP applicationshave been integrated in both active and passive DSS, and specificinformation that was extracted was used for decision support[24,50–52,61]. Although NLP methods have been used to extractinformation to infer a variety of conditions in text reports[27,28,34], to our knowledge, ours is the first system that usesNLP information extraction to accurately estimate the probabilityof disease by extracting a multitude of radiology imaging featuresand using that in a BN to infer the probability of disease.

A particularly relevant prior NLP system is MedLEE, asemantically-driven NLP system originally designed for decisionsupport applications in the domain of radiological reports of thechest, and it was later extended for mammography reports[62,63]. MedLEE is a generalizable system and includes extensiblelexicons and deep parsing and that could theoretically be extendedto meet the needs of DSS for mammography. However, its develop-ment in mammography did not incorporate BI-RADS or patientclinical variables required for inferring the probability of breastcancer in a DSS [62]. In addition, it was not publicly available formodification or extension.

Other related work has used BI-RADS as a knowledge resourcefor information extraction tasks [64–69], but not to the level ofdetailed recognition and extraction of each lesion and its charac-teristics that is needed for lesion-specific decision support, as wedescribe in this study. Similarly Gao et al. focused on extractionof four mammographic findings (mass, calcification, asymmetry,and architectural distortion) with their laterality information.However, they did not included other imaging observations’ char-acteristics [35]. A study by Sippo focused only on extraction BI-RADS final assessment categories from mammography reports[68]. A study by Nassif used a simple and effective parser, basedon regular grammar expressions, to extract BI-RADS terms fromEnglish free-text documents [65,66]. The Nassif study also con-

structed a parser to extract Portuguese BI-RADS features [67].Although one of the aims of those studies was to extract informa-tion to use for decision purposes, none of the prior works wereintegrated with a DSS with an evaluation of the decision supportoutputs. We thus believe that our study is the first which inte-grates and evaluates NLP-with a decision support model for mam-mography interpretation.

The performance of any integrated NLP-DSS system is con-strained by the quality of the NLP of the input text. Although theNLP system used in our work has good overall accuracy, it is imper-fect, and an additional novelty of our work is in assessing theimpact of inaccurate NLP extraction on the decision support out-puts from DSS. The concordance correlation between the probabil-ities produced by DSS using our NLP (NLP-DSS) and when using thereference standard inputs (RS-DSS) in 702 breast lesions was 0.95.In addition, the accuracy in the qualitative BI-RADS categories forthe 702 breast lesions was 98% when comparing the RS-DSS andour NLP-DSS.

A limitation of using NLP for generating the inputs for DSS isthat the information in narrative texts may be inconsistent orincomplete [37]. In fact, in our study, we found that fewer than half(138 of 300) of the mammography reports described personal andfamily history of breast cancer, and very few reports described theage and hormone treatment history. The decision support outputof our system may, in fact, help the radiologist to recognize incom-plete and inconsistent reports. An enhancement we could pursuewith our NLP-DSS in the future is to examine the inputs to the deci-sion model to get insights into the type of information in the radi-ology report that may be incomplete and inconsistent and providefeedback about that to the radiologist, e.g., the report lacks per-sonal and/or family history of breast cancer.

There are many approaches to NLP of narrative texts, includingsimple keyword extraction, information extraction, and naturallanguage understanding. For the task of decision support withmammography reports, we argue that extraction of informationframes is needed, since mammograms may have multiple lesionsand it is necessary to disambiguate and associate the descriptorsof the various lesions. Fiszman et al. [49] showed that an NLP sys-tem to identify pneumonia performed better than the simplekeyword-based methods.

An advantage of our system is that it provides necessary inputfor DSS directly from the text; the data it extracts from the radiol-ogy report can directly drive the DSS. Beyond potential use in clin-ical practice, our NLP-DSS may facilitate large-scale studies relatedto covariates of breast cancer risk by enabling the collection ofstructured descriptions of the characteristics of breast lesions inlarge scale during routine clinical practice, and enable improve-ment in risk prediction models by allowing them to better incorpo-rate real time information extraction. Beyond the utility of ourmethods to enabling decision support, methods such as ours couldenable discovery by enabling researchers to tap into large histori-cal collections of clinical report data. The information in mammog-raphy reports is a key component in diagnosing breast cancer, andautomated extraction of imaging observations from a large repos-itory of radiology reports in which the cancer outcomes are knowncould permit hypothesis generation about the clinical importanceof particular imaging observations, or the most appropriate thresh-olds of probability for malignancy warranting biopsy.

Our work has several limitations. Although our NLP-DSS pro-duced similar results to the reference standard in terms of recog-nizing and extracting BI-RADS descriptors for DSS based on thetext narrative mammography reports, the generalizability of oursystem among different institutions has not been tested. However,we could likely extend our system as needed to accommodate localvariations in reporting conventions; mammography reports tendto follow a similar constrained language and content as part of


good clinical practice. Future studies with larger and more diversereports could be performed to confirm the accuracy and generaliz-ability of our approach to reports from other institutions.

Another limitation of our system is that it does not extract ageand hormone treatment history. This information was recorded intoo few reports, but our methods could be extended to capture itby adding more rules to our NLP system in the future once weacquire more report examples. In addition, our system could detectthe absence of such information and prompt the radiologist to pro-vide this information as they dictate their reports, or, if suitableinterfaces are available, the age and hormone treatment historycould be extracted from the electronic medical record system.

Our NLP method is domain-specific and might not be generaliz-able to other applications within the field of radiology. We will beinvestigating extensions of our system to other types of radiologyreports. We did not compare the output of our system directly tothe actual physician inferences (e.g., to their assessed probabilityof malignancy). Radiologists do not report their estimated proba-bility of malignancy in mammography reports, so we estimatedthis by using a BN applied to their imaging observations in our ref-erence standard, and we compared those probabilities (as well asthe BI-RADS categories derived from them) to those produced byour NLP-DSS.

Integrating an NLP system with decision support in aproduction-level system may be difficult. This was the case in theAntibiotic Assistant [49], and the authors needed to implement asimpler keyword-based approach to NLP. Although we have notyet attempted to deploy our NLP-DSS into the clinical workflow,we believe it will be feasible if our NLP system can be invoked asa service to the production reporting application. Commercial radi-ology reporting applications have voice-to-text functionality, andthey show the text reports to the radiologists as they dictate them.Some vendors provide integration interfaces, which could enable usto consume the report text and send the output of the decision sup-port application to the radiologist’s display screen. The practicalityof this approach will need to be explored in future work, however.

Notwithstanding the foregoing limitations and challenges, webelieve there is potential for clinical utility of our system toimprove radiologist practice by enabling DSS in conjunction withreporting, providing feedback about the probability of malignancybased on the content of their reports. The ultimate impact of thison actual patient outcomes will need to be assessed in future work.

5. Conclusion

Our study to create and evaluate a NLP-DSS showed that theNLP component of our system acquires sufficient informationneeded as inputs to a DSS to produce results that are consistentwith those obtained when using inputs from a reference standard.This raises the potential of introducing the NLP-DSS into the mam-mography interpretation workflow, potentially enabling real-timedecision support. The NLP system performs automated extractionof BI-RADS descriptors and certain patient clinical data from thereports dictated by radiologists. A Bayesian DSS system uses theBI-RADS descriptors observed by the radiologist with the patientdata to estimate the probability of malignancy and BI-RADS cate-gories. Our system could ultimately reduce the variation in practicein mammography related to assessment of malignant lesions andimprove management decisions. With further testing, the systemmay ultimately help to improve mammography practice andimprove the quality of patient care.

Conflicts of interest

There is no conflict of interest.

Acknowledgment

The Scientific and Technological Research Council of Turkey(TUBITAK) supported this project (number 2214-A).

References

[1] K. Kerlikowske, D. Grady, S.M. Rubin, C. Sandrock, V.L. Ernster, Efficacy ofscreening mammography: a meta-analysis, JAMA 273 (2) (1995) 149–154.

[2] S. Jenks, Mammography Screening Still Brings Mixed Advice, J. Natl. CancerInst. 107 (8) (2015). doi: djv232 [pii]. 10.1093/jnci/djv232[published OnlineFirst: Epub Date].

[3] G. Ciccone, P. Vineis, A. Frigerio, N. Segnan, Inter-observer and intra-observervariability of mammogram interpretation: a field study, Eur. J. Cancer 28A (6–7) (1992) 1054–1058.

[4] L.E. Duijm, M.W. Louwman, J.H. Groenewoud, L.V. van de Poll-Franse, J.Fracheboud, J.W. Coebergh, Inter-observer variability in mammographyscreening and effect of type and number of readers on screening outcome,Br. J. Cancer 100 (6) (2009) 901–907. doi: 6604954 [pii]. 10.1038/sj.bjc.6604954[published Online First: Epub Date].

[5] C.A. Beam, P.M. Layde, D.C. Sullivan, Variability in the interpretation ofscreening mammograms by US radiologists: findings from a national sample,Arch. Int. Med. 156 (2) (1996) 209–213.

[6] W.E. Barlow, C. Chi, P.A. Carney, et al., Accuracy of screening mammographyinterpretation by characteristics of radiologists, J. Natl. Cancer Inst. 96 (24)(2004) 1840–1850. doi: 96/24/1840 [pii]. 10.1093/jnci/djh333[publishedOnline First: Epub Date].

[7] J.G. Elmore, S.L. Jackson, L. Abraham, et al., Variability in interpretiveperformance at screening mammography and radiologists’ characteristicsassociated with accuracy, Radiology 253 (3) (2009) 641–651. doi:radiol.2533082308 [pii]. 10.1148/radiol.2533082308[published Online First:Epub Date].

[8] J.G. Elmore, D.L. Miglioretti, L.M. Reisch, et al., Screening mammograms bycommunity radiologists: variability in false-positive rates, J. Natl. Cancer Inst.94 (18) (2002) 1373–1380.

[9] K. Kerlikowske, D. Grady, J. Barclay, et al., Variability and accuracy inmammographic interpretation using the American College of RadiologyBreast Imaging Reporting and Data System, J. Natl. Cancer Inst. 90 (23)(1998) 1801–1809.

[10] C.A. Beam, E.F. Conant, E.A. Sickles, Association of volume and volume-independent factors with accuracy in screening mammogram interpretation, J.Natl. Cancer Inst. 95 (4) (2003) 282–290.

[11] L. Esserman, H. Cowley, C. Eberle, et al., Improving the accuracy ofmammography: volume and outcome relationships, J. Natl. Cancer Inst. 94(5) (2002) 369–375.

[12] R. Clark, Re: Accuracy of screening mammography interpretation bycharacteristics of radiologists, J. Natl. Cancer Inst. 97 (12) (2005) 936. doi:97/12/936-a [pii]. 10.1093/jnci/dji156[published Online First: Epub Date].

[13] M.N. Linver, J.R. Osuch, R.J. Brenner, R.A. Smith, The mammography audit: aprimer for the mammography quality standards act (MQSA), AJR Am. J.Roentgenol. 165 (1) (1995) 19–25. doi:10.2214/ajr.165.1.7785586[publishedOnline First: Epub Date].

[14] P.A. Carney, E.A. Sickles, B.S. Monsees, et al., Identifying minimally acceptableinterpretive performance criteria for screening mammography, Radiology 255(2) (2010) 354–361. doi: 255/2/354 [pii]. 10.1148/radiol.10091636[publishedOnline First: Epub Date].

[15] D.L. Rubin, Informatics in radiology: measuring and improving quality inradiology: meeting the challenge with informatics, Radiographics 31 (6)(2011) 1511–1527. doi: 31/6/1511 [pii]. 10.1148/rg.316105207[publishedOnline First: Epub Date].

[16] K. Ganesan, R.U. Acharya, C.K. Chua, L.C. Min, B. Mathew, A.K. Thomas, Decisionsupport system for breast cancer detection using mammograms, Proc. Inst.Mech. Eng. H 227 (7) (2013) 721–732. doi: 0954411913480669 [pii]. 10.1177/0954411913480669[published Online First: Epub Date].

[17] S.M. Stivaros, A. Gledson, G. Nenadic, X.J. Zeng, J. Keane, A. Jackson, Decisionsupport systems for clinical radiological practice – towards the nextgeneration, Br. J. Radiol. 83 (995) (2010) 904–914. doi: 83/995/904 [pii].10.1259/bjr/33620087[published Online First: Epub Date]|..

[18] E. Burnside, D. Rubin, R. Shachter, A Bayesian network for mammography,Proc. AMIA Symp. (2000) 106–110. doi: D200565 [pii][published Online First:Epub Date].

[19] C.E. Kahn Jr., L.M. Roberts, K.A. Shaffer, P. Haddawy, Construction of a Bayesiannetwork for mammographic diagnosis of breast cancer, Comput. Biol. Med. 27(1) (1997) 19–29. doi: S001048259600039X [pii][published Online First: EpubDate].

[20] A.X. Garg, N.K. Adhikari, H. McDonald, et al., Effects of computerized clinicaldecision support systems on practitioner performance and patient outcomes:a systematic review, JAMA 293 (10) (2005) 1223–1238. doi: 293/10/1223[pii].10.1001/jama.293.10.1223[published Online First: Epub Date].

[21] D. Lobach, Research USAfH, Quality, Center DUE-bP, Enabling Health CareDecisionmaking Through Clinical Decision Support and KnowledgeManagement, 2012.

[22] E.S. Burnside, D.L. Rubin, R.D. Shachter, Using a Bayesian network to predictthe probability and type of breast cancer represented by microcalcifications on

http://refhub.elsevier.com/S1532-0464(16)30055-7/h0005











































































mammography, Stud. Health Technol. Inform. 107 (Pt 1) (2004) 13–17. doi:D040004433 [pii][published Online First: Epub Date].

[23] M. Fitzgerald, N. Farrow, P. Scicluna, A. Murray, Y. Xiao, C.F. Mackenzie,Challenges to Real-Time Decision Support in Health Care (vol. 2: Culture andRedesign), 2008. doi: NBK43697 [bookaccession][published Online First: EpubDate].

[24] D. Demner-Fushman, W.W. Chapman, C.J. McDonald, What can naturallanguage processing do for clinical decision support?, J Biomed. Inform. 42(5) (2009) 760–772. doi: S1532-0464(09)00108-7[pii]. 10.1016/j.jbi.2009.08.007[published Online First: Epub Date].

[25] F. Schiavon, Radiological Reporting in Clinical Practice, Springer, New York,2007.

[26] P.M. Nadkarni, L. Ohno-Machado, W.W. Chapman, Natural languageprocessing: an introduction, J. Am. Med. Inform. Assoc. 18 (5) (2011) 544–551. doi: amiajnl-2011-000464[pii]. 10.1136/amiajnl-2011-000464[publishedOnline First: Epub Date].

[27] E. Pons, L.M. Braun, M.G. Hunink, J.A. Kors, Natural language processing inradiology: a systematic review, Radiology 279 (2) (2016) 329–343. doi:10.1148/radiol.16142770[published Online First: Epub Date].

[28] W.W. Chapman, M. Fizman, B.E. Chapman, P.J. Haug, A comparison ofclassification algorithms to automatically identify chest X-ray reports thatsupport pneumonia, J. Biomed. Inform. 34 (1) (2001) 4–14. doi: S1532-0464(01)91000-7[pii]. 10.1006/jbin.2001.1000[published Online First: Epub Date].

[29] W.W. Yim, M. Yetisgen, W.P. Harris, S.W. Kwan, Natural language processing inoncology: a review, JAMA Oncol. 2 (6) (2016) 797–804. doi: 2517402[pii].10.1001/jamaoncol.2016.0213[published Online First: Epub Date].

[30] M. Sevenster, J. Bozeman, A. Cowhy, W. Trost, A natural language processingpipeline for pairing measurements uniquely across free-text CT reports, J.Biomed. Inform. 53 (2015) 36–48. doi: S1532-0464(14)00196-8[pii]. 10.1016/j.jbi.2014.08.015[published Online First: Epub Date].

[31] A.D. Pham, A. Neveol, T. Lavergne, et al., Natural language processing ofradiology reports for the detection of thromboembolic diseases and clinicallyrelevant incidental findings, BMC Bioinform. 15 (2014) 266. doi: 1471-2105-15-266[pii]. 10.1186/1471-2105-15-266[published Online First: Epub Date].

[32] A. Wilcox, G. Hripcsak, Classification algorithms applied to narrative reports,Proc. AMIA Symp. (1999) 455–459. doi: D005785 [pii][published Online First:Epub Date].

[33] A. Wilcox, G. Hripcsak, Medical text representations for inductive learning,Proc. AMIA Symp. (2000) 923–927. doi: D200599 [pii][published Online First:Epub Date].

[34] T. Cai, A.A. Giannopoulos, S. Yu, et al., Natural language processingtechnologies in radiology research and clinical applications, Radiographics36 (1) (2016) 176–191. doi: 10.1148/rg.2016150080[published Online First:Epub Date].

[35] H. Gao, E.J.A. Bowles, D. Carrell, D.S. Buist, Using natural languageprocessing to extract mammographic findings, J. Biomed. Inform. 54 (2015)77–84.

[36] A.E. Wieneke, E.J. Bowles, D. Cronkite, et al., Validation of natural languageprocessing to extract breast cancer pathology procedures and results, J. Pathol.Inform. 6 (2015).

[37] Z. Tian, S. Sun, T. Eguale, C.M. Rochefort, Automated extraction of VTE eventsfrom narrative radiology reports in electronic health records: a validationstudy, Med. Care (2015).

[38] S. Bozkurt, J.A. Lipson, U. Senol, D.L. Rubin, Automatic abstraction of imagingobservations with their characteristics from mammography reports, J. Am.Med. Inform. Assoc. (2014). doi: 10.1136/amiajnl-2014-003009[publishedOnline First: Epub Date].

[39] L. Liberman, J.H. Menell, Breast imaging reporting and data system (BI-RADS),Radiol. Clin. North Am. 40 (3) (2002) 409–430. v.

[40] O. Bodenreider, Biomedical ontologies in action: role in knowledgemanagement, data integration and decision support, Yearb. Med. Inform.(2008) 67–79. doi: me08010067 [pii][published Online First: Epub Date].

[41] H. Harkema, J.N. Dowling, T. Thornblade, W.W. Chapman, ConText: analgorithm for determining negation, experiencer, and temporal status fromclinical reports, J. Biomed. Inform. 42 (5) (2009) 839–851. doi: S1532-0464(09)00074-4 [pii].

[42] E.S. Burnside, J. Davis, J. Chhatwal, et al., Probabilistic computer modeldeveloped from clinical data in national mammography database format toclassify mammographic findings, Radiology 251 (3) (2009) 663–672. doi:2513081346[pii]. 10.1148/radiol.2513081346[published Online First: EpubDate].

[43] E.S. Burnside, D.L. Rubin, J.P. Fine, R.D. Shachter, G.A. Sisney, W.K. Leung,Bayesian network to predict breast cancer risk of mammographicmicrocalcifications and reduce number of benign biopsy results: initialexperience, Radiology 240 (3) (2006) 666–673. doi: 240/3/666[pii]. 10.1148/radiol.2403051096[published Online First: Epub Date].

[44] E.S. Burnside, E.A. Sickles, L.W. Bassett, et al., The ACR BI-RADS experience:learning from history, J. Am. Coll. Radiol. 6 (12) (2009) 851–860. doi: S1546-1440(09)00390-1[pii]. 10.1016/j.jacr.2009.07.023[published Online First:Epub Date].

[45] L.I. Lin, A concordance correlation coefficient to evaluate reproducibility,Biometrics 45 (1) (1989) 255–268.

[46] S. Doan, M. Conway, T.M. Phuong, L. Ohno-Machado, Natural languageprocessing in biomedicine: a unified system architecture overview, MethodsMol. Biol. 1168 (2014) 275–294. doi: 10.1007/978-1-4939-0847-9_16[published Online First: Epub Date].

[47] Y. Ni, J. Wright, J. Perentesis, et al., Increasing the efficiency of trial-patientmatching: automated clinical trial eligibility pre-screening for pediatriconcology patients, BMC Med. Inform. Decis. Mak. 15 (2015) 28. doi:10.1186/s12911-015-0149-3. 10.1186/s12911-015-0149-3 [pii][publishedOnline First: Epub Date].

[48] S. Doan, C.K. Maehara, J.D. Chaparro, et al., Building a natural languageprocessing tool to identify patients with high clinical suspicion for kawasakidisease from emergency department notes, Acad. Emerg. Med. 23 (5) (2016)628–636. doi: 10.1111/acem.12925[published Online First: Epub Date].

[49] M. Fiszman, W.W. Chapman, D. Aronsky, R.S. Evans, P.J. Haug, Automaticdetection of acute bacterial pneumonia from chest X-ray reports, J. Am. Med.Inform. Assoc. 7 (6) (2000) 593–604.

[50] K.B. Wagholikar, K.L. MacLaughlin, M.R. Henry, et al., Clinical decision supportwith automated text processing for cervical cancer screening, J. Am. Med.Inform. Assoc. 19 (5) (2012) 833–839. doi: amiajnl-2012-000820[pii].10.1136/amiajnl-2012-000820[published Online First: Epub Date].

[51] K. Wagholikar, S. Sohn, S. Wu, et al., Workflow-based data reconciliation forclinical decision support: case of colorectal cancer screening and surveillance,AMIA Jt Summits Transl. Sci. Proc. 2013 (2013) 269–273.

[52] R.S. Evans, J. Benuzillo, B.D. Horne, et al., Automated identification andpredictive tools to help identify high-risk heart failure patients: pilotevaluation, J. Am. Med. Inform. Assoc. (2016). doi: ocv197[pii].10.1093/jamia/ocv197[published Online First: Epub Date].

[53] R.A. Greenes, Clinical Decision Support: The Road to Broad Adoption, ElsevierScience, 2014.

[54] A.R. Aronson, F.M. Lang, An overview of MetaMap: historical perspective andrecent advances, J. Am. Med. Inform. Assoc. 17 (3) (2010) 229–236. doi: 17/3/229[pii]. 10.1136/jamia.2009.002733[published Online First: Epub Date].

[55] G.K. Savova, J.J. Masanz, P.V. Ogren, et al., Mayo clinical text analysis andknowledge extraction system (cTAKES): architecture, component evaluationand applications, J. Am. Med. Inform. Assoc. 17 (5) (2010) 507–513. doi: 17/5/507[pii]. 10.1136/jamia.2009.001560[published Online First: Epub Date].

[56] Q.T. Zeng, S. Goryachev, S. Weiss, M. Sordo, S.N. Murphy, R. Lazarus, Extractingprincipal diagnosis, co-morbidity and smoking status for asthma research:evaluation of a natural language processing system, BMC Med. Inform. Decis.Mak. 6 (2006) 30. doi: 1472-6947-6-30[pii]. 10.1186/1472-6947-6-30[published Online First: Epub Date].

[57] R.S. Crowley, M. Castine, K. Mitchell, G. Chavan, T. McSherry, M. Feldman,CaTIES: a grid based system for coding and retrieval of surgical pathologyreports and tissue specimens in support of translational research, J. Am. Med.Inform. Assoc. 17 (3) (2010) 253–264. doi: 17/3/253[pii].10.1136/jamia.2009.002295[published Online First: Epub Date].

[58] E.A. Mendonca, J. Haas, L. Shagina, E. Larson, C. Friedman, Extractinginformation on pneumonia in infants using natural language processing ofradiology reports, J. Biomed. Inform. 38 (4) (2005) 314–321. doi: S1532-0464(05)00016-X[pii]. 10.1016/j.jbi.2005.02.003[published Online First: EpubDate].

[59] D.L. Mowery, B.E. Chapman, M. Conway, et al., Extracting a stroke phenotyperisk factor from Veteran Health Administration clinical reports: an informationcontent analysis, J. Biomed. Semantics 7 (2016) 26. doi: 10.1186/s13326-016-0065-1. 65 [pii][published Online First: Epub Date].

[60] D. Dligach, S. Bethard, L. Becker, T. Miller, G.K. Savova, Discovering body siteand severity modifiers in clinical texts, J. Am. Med. Inform. Assoc. 21 (3) (2014)448–454. doi: amiajnl-2013-001766[pii]. 10.1136/amiajnl-2013-001766[published Online First: Epub Date].

[61] S.M. Meystre, G.K. Savova, K.C. Kipper-Schuler, J.F. Hurdle, Extractinginformation from textual documents in the electronic health record: areview of recent research, Yearb. Med. Inform. (2008) 128–144. doi:me08010128 [pii][published Online First: Epub Date].

[62] N.L. Jain, C. Friedman, Identification of findings suspicious for breast cancerbased on natural language processing of mammogram reports, Proc. AMIAAnnu. Fall. Symp. (1997) 829–833.

[63] C. Friedman, A broad-coverage natural language processing system, Proc. AMIASymp. (2000) 270–274. doi: D200144 [pii][published Online First: Epub Date].

[64] E.R. Burnside, D. Rubin, H. Strasberg, Automated indexing of mammographyreports using linear least squares fit, International Congress Series –Amsterdam – Excerpta Medica Then Elsevier Science, 14th Computerassisted radiology and surgery; CARS 2000, 2000

[65] B. Percha, H. Nassif, J. Lipson, E. Burnside, D. Rubin, Automatic classification ofmammography reports by BI-RADS breast tissue composition class, J. Am.Med. Inform. Assoc. 19 (5) (2012) 913–916. doi: amiajnl-2011-000607 [pii].10.1136/amiajnl-2011-000607[published Online First: Epub Date]|..

[66] H. Nassif, R. Woods, E. Burnside, M. Ayvaci, J. Shavlik, D. Page, Informationextraction for clinical data mining: a mammography case study, in: Proc. IEEEInt. Conf. Data Min., 2009, pp. 37–42.

[67] H. Nassif, F. Cunha, I.C. Moreira, et al., Extracting BI-RADS features fromPortuguese clinical texts, in: Proceedings (IEEE Int. Conf. BioinformaticsBiomed.), 2012, pp. 1–4.

[68] D.A. Sippo, G.I. Warden, K.P. Andriole, et al., Automated extraction of BI-RADSfinal assessment categories from radiology reports with natural languageprocessing, J. Digit. Imaging 26 (5) (2013) 989–994. doi: 10.1007/s10278-013-9616-5[published Online First: Epub Date].

[69] M. Sevenster, R. van Ommering, Y. Qian, Automatically correlating clinicalfindings and body locations in radiology reports using MedLEE, J. Digit.Imaging 25 (2) (2012) 240–249. doi: 10.1007/s10278-011-9411-0[publishedOnline First: Epub Date].









































































































































































Journal of Biomedical Informatics - Stanford University · Received 20 March 2016 Revised 22 June...

Documents

Transcript of Journal of Biomedical Informatics - Stanford University · Received 20 March 2016 Revised 22 June...