BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and...
-
Upload
oswin-hawkins -
Category
Documents
-
view
220 -
download
0
Transcript of BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and...
![Page 1: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/1.jpg)
BioNLP, Information Extraction BioNLP, Information Extraction from Radiology Reportsfrom Radiology Reports
Emilia ApostolovaEmilia ApostolovaCollege of Computing and Digital MediaCollege of Computing and Digital MediaDePaul UniversityDePaul University
![Page 2: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/2.jpg)
BioNLP – conferences and shared tasks
Pacific Symposium on Biocomputing Intelligent Systems for Molecular Biology Association for Computational Linguistics North American Association for Computational Linguistics BioNLP BioCreative TREC Genomics IClef
![Page 3: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/3.jpg)
Information Extraction (in BioMedicine)
The NLP Pipeline
• Lexical Analysis – tokenization, morphological analysis, linguistic lexicons.
• Syntactic Analysis – Part of Speech Tagging, Chunking, Parsing.
• Semantic Analysis – Lexical Semantic Interpretation, Semantic Interpretation of Utterances.
![Page 4: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/4.jpg)
NLP Pipeline Frameworks
• GATE - General Architecture for Text Engineering.
• Apache UIMA - Unstructured Information Management Application.
• Geneways - a system for automatically extracting, analyzing, visualizing and integrating molecular pathway data from the research literature.
• PASTA - Protein Structures and Information Extraction from Biological Texts.
![Page 5: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/5.jpg)
Lexical Analysis - Tokenization
Segmenting text into linguistic tokens – words and sentences.
• Abbreviations - The Study was conducted within the U.S.
• Apostrophes - IL-10's cytokine synthesis inhibitory activity
• Hyphenation - co-operate, cooperate• Multiple formats: 464,285.23 and
464295.23• Sentence boundary detection - :, ;, -
![Page 6: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/6.jpg)
Lexical Analysis – Morphological analysis
Link surface variants of a lexical element to its canonical base form. E.g. inflections (activat-es, activat-ed, activat-ing), derivations (activation).
Porter stemmer – lexicon-free approach. Finds longest match of a word to a a list of English derivational and inflectional suffixes.
Two-level morphology – a finite state based approach that applies a series of parallel transducers to input tokens. (fly -> flies)
![Page 7: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/7.jpg)
Syntactic Level – Part of Speech Tagging
activation – POS noun, singular
activate – POS verb, present non-3d person singular
active – POS adjective
report?
![Page 8: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/8.jpg)
Syntactic Level - Parsing
A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.
The Stanford Dependency Parser - a Java implementation of probabilistic natural language parsers, trained on the Penn Treebank.
![Page 9: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/9.jpg)
Semantic Level – Lexical Interpretation
• Selectional Restrictions:
transitive verbs: inhibit [something], transcribe [something]
semantic restrictions: inhibit [Process], transcribe [Nucleic Acid]
Syntactically admissible, but semantically invalid:
to inhibit amino acids
to transcribe cell growth
![Page 10: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/10.jpg)
Discourse Level - Pragmatics
• Discourse referents; what entities does a given
message refer to?
• What background knowledge is needed to
understand a given message?
• How do the beliefs of speaker and hearer interact
in the interpretation of a message?
• What is a relevant answer to a given question?
• Summarization, Translation, Dialog Systems,
Natural Language Generation.
![Page 11: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/11.jpg)
Lexical resources for (Bio)NLP
• Princeton Wordnet
• NLM UMLS lexicon and metathesaurus.
• The Open Biomedical Ontologies
![Page 12: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/12.jpg)
Text and Image Integration
![Page 13: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/13.jpg)
Automatic Image Annotation
![Page 14: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/14.jpg)
Automatic Image Annotation
Where? Woman (Population Group), Right breast
(Body Part, Organ, or Organ Component)
How? Mammography (Diagnostic Procedure)
What? Calcification (Pathologic Function), Lesion
(Finding), Carcinoma, Papillary (Neoplastic Process)
![Page 15: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/15.jpg)
IE from Clinical Texts – Radiology and Pathology Reports
Northwestern University Medical School
Department of Radiology
Imaging Informatics
![Page 16: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/16.jpg)
Radiology Reports
![Page 17: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/17.jpg)
Sample Radiology ReportPatient Name: XXXXXXX, XXXXXMedical Record Number: XXXXXXXXXX DOB: XXXX.XX.XX Sex: F
Accession Number: XXXXXXXXStudy Requested: DIG MAMMOGRAM SCREENING (3300000)Scheduled Date and Time: XXXX.XX.XX 13:02:00.0000
Requesting Physician: XXXXXXX,
Reason for Exam: V76.12
----------------------------Radiological Report---------------------------------
Comparison is made to previous exams dated XX/XX/XX.
CLINICAL HISTORY: Seventy-two year old woman for screening exam. Patient has a family history of breast cancer, sister age sixty years old. Patient has a history of a previous left breast benign biopsy.
TECHNIQUE: Mammograms were obtained using digital technique.
FINDINGS: There is dense fibroglandular tissue bilaterally. No dominant masses or clustered microcalcifications suggestive of malignancy are seen.
1. NO SPECIFIC FEATURES OF MALIGNANCY SEEN EITHER BREAST.
2. NO SIGNIFICANT CHANGE WHEN COMPARED WITH PRIOR STUDIES.
3. ANNUAL SCREENING MAMMOGRAM IS RECOMMENDED.
CODE (1): NEGATIVE
Attending Radiologist: XXXXXXX, MDDate Signed off: XXXXXX, Transc. by: NS
![Page 18: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/18.jpg)
NLP for Clinical Texts
• Document retrieval – case finding.
• Subject recruitment – identify patients that can benefit from a study.
• Surveillance – monitoring disease outbreaks.
• Discovery of disease-drug associations.
• Discovery of disease-finding associations.
![Page 19: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/19.jpg)
IE from Radiology Reports
Automatic Section Segmentation
Demographics
History
Comparison
Technique
Findings
Impression
Recommendation
Sign off
![Page 20: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/20.jpg)
Dataset
215,000 free-text radiology reports selected randomly
from 3 million reports over period of 9 years and
representing 24 different types of diagnostic procedures.
![Page 21: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/21.jpg)
Method – Training Set
• Hand-crafted rules for automatic extraction of a training set. Common boundary patterns: e.g. section Findings – text between known section headers and another known headings:
^ (finding | observation | discussion)s?:
^ (\W*)(finding | observation | discussion)s?(\W*)$
• 3,000 automatically segmented “high-confidence” radiology reports, containing all 8 sections of interest.
![Page 22: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/22.jpg)
Method
• Classification task - each sentence from a radiology report is assigned to one of 8 pre-defined report sections.
![Page 23: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/23.jpg)
•Sentence features used for training a classifier.
Sentence Orthography Possible orthographic types are All Capitals, Mixed Case, or presence of a Header pattern, such as a phrase at the beginning of a line followed by a colon.
Previous Sentence Boundary
Formatting boundary separating the current and previous text sentences. Possible values are white space containing new lines, white space without new lines, non-alphabetic characters, or the beginning of the file.
Following Sentence Boundary
Formatting boundary separating the current and next text sentences. Possible values are white space containing new lines, white space without new lines, non-alphabetic characters, or the end of the file.
Cosine Vector Distance Distance from the current sentence to each of the eight sections' word vectors.
Exact Header Match This feature specifies if the sentence contains a header identified as belonging to one of the sections in the training data.
![Page 24: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/24.jpg)
Work in Progress
• Identify named entities within sections
using a controlled vocabulary – findings,
diseases, observations, anatomical organs,
imaging modalities.
• Negation Discovery.
• Identify relationships between named
entities of interest, for example what
observations are associated with a
diagnosis.
• Use radiology report text to support
automatic annotation of medical images.
![Page 25: BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University.](https://reader030.fdocuments.net/reader030/viewer/2022033105/56649e165503460f94b01636/html5/thumbnails/25.jpg)
Q/A