An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain
description
Transcript of An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain
An NLP Ecosystemfor Development and Use
of Natural Language Processing in the Clinical Domain
Wendy W. Chapman, PhD
Division of Biomedical InformaticsUniversity of California, San Diego
Integrating Data for Analysis, Anonymization, and Sharing
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• iDASH
• Opportunities for sharing and collaboration in NLP
NLP Success
Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,” New York Daily News February 18th 2011
“IBM's computer could very well
herald a whole new era in
medicine." ComputerWorld
February 17, 2011
Dr. Watson??
Clinical NLP Since 1960’s
Why has clinical NLP had little impact on clinical care?
Barriers to Development
• Sharing clinical data difficult– Have not had shared datasets for development and
evaluation– Modules trained on general English not sufficient
• Insufficient common conventions and standards for annotations– Data sets are unique to a lab– Not easily interchangeable
• Limited collaboration– Clinical NLP applications silos and black boxes– Have not had open source applications
• Reproducibility is formidable– Open source release not always sufficient– Software engineering quality not always great– Mechanisms for reproducing results are sparse
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• Developing an NLP ecosystem on iDASH
Security & Privacy Concerns
• Clinical texts have many patient identifiers– 18 HIPAA identifiers
• Names• Addresses
• Items not regulated by HIPAA– tight end for the Steelers
• Unique cases– 50s-year-old woman who is pregnant
• Sensitive information– HIV status
Institutions are reluctant to share dataInstitutions are reluctant to share data
Lack of user-centered development and scalability– Perceived cost of applying NLP outweighs the
perceived benefit (Len D’Avolio)
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• Developing an NLP ecosystem on iDASH
iDASH
• integrating Data• Analysis• Anonymization• Sharing
DataData
Computational Resources
Computational Resources
Software/ToolsSoftware/Tools
Disincentives to Share
• ‘Scooping’ by faster analysts Exposure of potential errors in data
• Resources for preparing data submissions• Maintaining data• Interacting with potential users takes time• Threat of privacy breach when human subjects
are involved– Do not have policies in place– Fallible de-identification, anonymization algorithms
iDASH aims to minimize these disincentivesiDASH aims to minimize these disincentives
nlp-ecosystem.ucsd.edu
Privacy preserving Privacy preserving
• Access control • De-identification • Query counts• Artificial data
generators
• Access control • De-identification • Query counts• Artificial data
generators
DigitalInformed consent
DigitalInformed consent
HIPAA &/or FISMA Compliant Cloud
CustomizableDUAs
CustomizableDUAs
Informed ConsentRegistry
Informed ConsentRegistry
152011 summer internship program funded by NIH U54HL108460
NLP Ecosystem
Data
MT SamplesTools & Services Collaborative
Development Tools
Virtual Machines
Evaluation Workbench
Education
Bibliography
TutorialsResearch
Resources
Guidelines
Schemas
De-Identification
UCSD Clinical Data
TxtVect
Annotation Admin & eHOST
Registry
Tools & Services Collaborative
Knowledge Authoring
Virtual Machines
Evaluation WorkbenchDe-
Identification
TextVect
Annotation Environment
Increase access to NLP
DecreaseBurden of
DevelopingNLP
Collaborative Effort to Build Ecosystem
Registry
orbit
Increase ability to find NLP tools
Registry: orbit.nlm.nih.gov
Len D’Avolio, Dina Demner-Fushman
De-identification service
Increase access to clinical text
De-identification
• Several available de-identification modules• Need to adapt to local text
– Efficient– Secure
• Customizable ensemble de-identification system– Build a de-identified corpus – Incorporate existing de-id modules– Launch as virtual machine– Iterative training, evaluation, and modification by user
• Correct mistakes
• Add regular expressions
Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery
TextVect
Increase access to textual features
TextVect
NLM: Abhishek Kumar
collaborative Knowledge Authoring Support Service (cKass)
Decrease the Burden of Customizing an NLP Application
Customizing an IE App
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
IE OutputIE Output
MapMap
Customizing an IE App
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
IE Output
Dry cough Productive coughCoughHacking coughBloody cough
IE Output
Dry cough Productive coughCoughHacking coughBloody cough
Which concepts?
Customizing an IE App
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
IE Output
Temp 38.0CLow-grade temperature
IE Output
Temp 38.0CLow-grade temperature
What is a fever?
Customizing an IE App
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
User’s ConceptsCough
DyspneaInfiltrate on CXR
WheezingFever
Cervical Lymphadenopathy
IE Output
NECK: no adenopathy
Disorder: adenopathyNegation: negated
IE Output
NECK: no adenopathy
Disorder: adenopathyNegation: negated
Section mapping
KOS-IEKnowledge Organization Systems for Information Extraction
Compile information helpful for IE
User KBUser KB
NLP ToolsNLP Tools
Physician Radiologist Nurse Clinical Researcher Knowledge Engineer.
Decision Support System
Decision Support System
Shared KBShared KB External KBExternal KB
Collaborative Knowledge Base Development: cKASS
LQ Wang, M Conway, F Fana, M Tharp, D Hillert
Knowledge Authoring
Augment user KB with lexical variants, synonyms, and related concepts
• User-driven authoring–Top-down: Provide access to external knowledge sources
• UMLS, Specialist Lexicon, Bioportal
–Bottom-up: Annotate to derive synonyms
• Recommendation-based authoring–Generate lexical variants–Mine external knowledge sources–Mine patient records
Evaluation workbench
Decrease the Burden of Evaluation & Error Analysis
Evaluation Workbench
• Compare the output of two NLP annotators on clinical text• NLP system vs human annotation
• View annotations• Calculate outcome measures • Drill down to all levels of annotation
• Document-level
• Perform error analysis• Future versions will support formal error analysis
Levels of Annotation
• Document – Report classified as Shigellosis
• Group – Section classified as Past Medical History Section
• Utterance – Group of text classified as Sentence
• Snippet – “chest pain” classified as CUI 058273
• Word – “pain” classified as noun)
• Token – “.” classified as EOS marker
34
Document & annotations
Outcome Measures forSelected Annotations
Select Classifications
to View
ReportList
Attributes for Selected
Annotation
Relationships for Selected
AnnotationVA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova
Annotation Environment
Decrease the Burden of Annotation
Challenges to Annotating
• Time consuming– Recruiting & training annotators for high agreement
• Expensive– Domain experts especially expensive– Need for annotation by multiple people
• Challenging to design annotation task– How many annotators?– How should I quantify quality of annotations?
• Logistically challenging– Managing files and batches of reports– Setting up annotation tool
• Reinventing the wheel– Hasn’t someone created a schema for this before?
How can we reduce the burden of annotation?
iDASH Annotation Environment
Annotation Admin eHOST
Web applicationiDASH cloud
Client app on your computer
VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser
Goal: provide an environment to decrease theBurden of annotation for research and application
Annotator Registry
Annotator Registry
• Enlist for annotation • Certify for annotation tasks
– Personal health information– Part-of-speech tagging– UMLS mapping
• Set pay rate
• Searchable• Available for inclusion in
new annotation taskhttp://idash.ucsd.edu/nlp-annotator-registry
Annotation Admin: Intended Users & Uses
Users• NLP researchers• Annotation administrators
Uses• Manage annotation projects – who annotates what
– Currently done with hundreds of files on hard drive
• Integrate with annotation tool (eHOST)– Download batches of raw reports to annotators– Upload and store annotated reports
• Manage simple annotation projects• Facilitate distributed annotation
1. Assign annotators to a task1. Assign annotators to a task
Annotation Admin
2. Create a Schema2. Create a Schema
3. Assign users and set time expectations3. Assign users and set time expectations
3. Keep track of progress3. Keep track of progress
Tools & Services Collaborative
Knowledge Authoring
Virtual Machines
Evaluation WorkbenchDe-
Identification
TextVect
Annotation Environment
Increase access to NLP
DecreaseBurden of
DevelopingNLP
Collaborative Effort to Build Resources
Registry
Conclusion
• More demand for EHR data– NLP has potential to extend value of narrative clinical reports
• There have been many barriers– To development– To deployment
• Recent developments facilitate collaboration & sharing– Common annotation conventions– Privacy algorithms– Shared datasets– Hosted environments
• iDASH hopes to facilitate – Development of NLP– Application of NLP
Questions | Discussion
Division of Biomedical InformaticsUniversity of California, San Diego
Integrating Data for Analysis, Anonymization, and Sharing
iDASH/ShARe Workshop on AnnotationSeptember 29, 2012
La Jolla, CA
iDASH/ShARe Workshop on AnnotationSeptember 29, 2012
La Jolla, CA