Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas...
Transcript of Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas...
Date
Natural Language Processing of the
clinical narrative in French to
support Public Health
Aurélie Névéol, CR1 CNRS
2/33
Natural Language Processing
is needed to support
Epidemiology and Public Health
…how?
3/33
Benefit of CT venography in the diagnosis of
pulmonary embolism and thromboembolic disease?
Prevalence of Incidental Findings?
NLP can produce supporting evidence
to address public health issues
4/33
NLP can create biomedical knowledge
from multiple and heterogeneous documents
Large
thrombus
burden in
the proximal
LAD artery
Obstruction
totale des
artères
segmentaires
Electronic Health
Records Biological Data
Repositories
Publications
Protocole
Social
Media
5/33
CABeRneT: Automatic Understanding of
Biomedical Text for Translational Research
Publications
Thrombose de la veine
ovarienne droite sur 1.5cm de haut
Patient record
Thrombose de la
veine ovarienne droite
sur 1.5cm de haut
C1267486
Entire Right
Ovarian Vein
C0040053
Thrombosis
LOCATION OF
Links
NLP
Analysis
Structured
Representation
New therapeutic
insight
Retrospective
Analysis
http://cabernet.limsi.fr
In practice…
Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avaisdemandés le 24 novembre 2004pour suspicion d’emboliepulmonaire.L’examen n’objective pas d’EP, nide TVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.
7/33
Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.
Personnal Health Identifiers
De-identification
needed to process health data.
Deidentification method[Grouin et Névéol 2014, Grouin et
al. 2014]
Study of reidentification risks[Grouin et al. 2015]
8/33
Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.
Entities and Relations
Representation Schema
18 entity types
37 relations
[Deléger et al. 2014; Deléger,
Campillos et al. 2017]
9/33
Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.
Assertions, hedges
4 modalities
8 aspects
Abbreviations
…Annotated Corpus[Campillos, Deléger et al. 2016]
10/33
Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.
Normalization
10
Entity Linking with the UMLS
3 Millions concepts
170,000 with French terms
C0023216C0149871C0034065
C0582103
11/33
A multidisciplinary endeavour
Computer Science:
• Knowledge Representation
• Natural Language Processing
• Methods
• Annotated corpus, tools
Medecine, Public Health, Epidemiology:
• Retrospective analysis of patient records
• Biomedical Information retrieval
12/33
Challenges of clinical NLP
• Data processing and analysis• Data confidentiality: deidentification
• Legal and Technical access to data
• Accommodating the complexity of biomedical language• Great language variety
• Several « sublanguages » according to Zellig Harris’ definition
• Make use of vast knowledge resources• UMLS ~3 million concepts
• Terms associated to concepts are primarily in English
13/33
CABeRneT at a glance
Task 1
Preparation of
annotated
corporaTask 3
Relation
extraction
Task 4
retrospective
analysis
Task 2
Entity, concept
extraction
Task 5
Linking EHRs
Task 6
Evaluation
14/33
Addressing an
open public health question
What is the prevalence of Incidental Findings in patients with suspected Pulmonary Embolism or suspected thromboembolic disease? [Pham et al. 2014]
• Corpus comprising 615 deidentified radiology reports
• Annotations for entities, relations, modalities, sections
• Binary classification for Incidental Findings
Clinical insight
• Overall prevalence: 15%
• Classification to be based on follow-up
P R F
Words 0.43 0.32 0.37
Words + annotations 0.67 0.50 0.57
+ Sections 0.76 0.81 0.80
NLP insight
• Complex analysis useful
15/33
Linking
from the electronic health record
• Information retrieval based on patient record [D’hondt et al. 2014]
• Participation to TREC 2014: retrieving articles from the literature
based on clinical case description
• Methodological work on English
• Redundancy in electronic health records [D’hondt et al. 2015, 2016]
• Was shown to have impact on language models [Cohen et al.
2013]
• Links between documents within the EHR:
• Identification of (near)-identical documents (duplicates)
• Identification of subsequent document versions
16/33
Preparation of annotated corpora
• Legal aspects
• CNIL, IRB, other…
• Implications for distribution, e.g. through shared task
• Choice of a representation scheme
• Existing schemes... reviewed in [Deléger, Campillos et al. 2017]
• Links to a knowledge source• Which source, or sources?
• Strict guidance from knowledge source?
• Choice of annotation tool and method
• Inline vs. Standoff annotations
• Use of pre-annotations?
• Human input [Grouin et al. 2014]
17/33
QUAERO French Medical Corpushttps://quaerofrenchmed.limsi.fr [Névéol et al. 2014]
• Legal aspects
• Open source text: MEDLINE titles and OPUS EMEA subset
• Used in CLEF eHealth 2015, 2016
• Choice of a representation scheme
• Links to a knowledge source: UMLS• Which source: all sources in UMLS
• Strict guidance: concepts, not terms
• Choice of annotation tool and method
• Inline vs. Standoff annotations: a little of both…
• Use of pre-annotations: yes [Névéol et al. 2010]
• Human input: two annotators, one revisor
18
QUAERO French Medical Corpuscontents
• Two Corpora
• MEDLINE titles: scientific littérature, short, low redundancy
• EMEA documents: drug handouts, long, high redundancy
• Ten entities of clinical interest are annotated
• Defined according to UMLS Semantic Groups [Bodenreider &
McCray, 2003]
• Anatomy, Chemicals & Drugs, Devices, Disorders, Geographic
Areas, Living Beings, Objects, Phenomena, Physiology,
Procedures
• Embedded and discontinuous entities
• Corpora were pre-annotated automatically
• Two annotators initially contributed using
– Detailed annotation guide
– Reference UMLS tools: EHTOP (in French)
http://www.hetop.eu/hetop/ and UTS metathesaurus browser (in
English) https://uts.nlm.nih.gov/metathesaurus.html
• One expert annotator later contributed towards
– Annotation harmonization
– Annotation revision
19
QUAERO French Medical Corpusannotation methodology
20/33
Annotation decisions
• Concept without a French term
• NCI parathyroide intrathyroïdale C3272635
• GO pupaison C1326578
• Term not associated with concept
• MeSH « Système ABO » -> « système ABO de groupes
sanguins » C0000778
• MeSH « français » -> « France » C0016674
• Concept not in the UMLS
• Daronrix (vaccin contre la grippe)
• IONSYS (dispositif antalgique)
C1529600C1529600 C0087111
C0001675
C0026769
C0026769
C0700589 C0344221 C0021900
C0042149
ME
DL
INE
EM
EA
21
Contraception by intrauterine devices
What is Tysabri used for?
Tysabri is used to treat adults with highly active multiple sclerosis (MS).
QUAERO French Medical Corpus
corpus excerpt
22
EMEA MEDLINE
Train. Dev. Test Train. Dev. Test
Tokens 14,944 13,271 12,042 10,552 10,503 10,871
Entities 2,695 2,260 2,204 2,994 2,977 3,103
Unique Entities 923 756 658 2,296 2,288 2,390
Unique CUIs 648 523 474 1,860 1,848 1,909
QUAERO French Medical Corpuscorpus release
• Data Format
• Stand-off (BRAT) and BioC
• Evaluation Tool
• Brateval
• Dataset statistics
• ~20% of CUIs assigned do not have a French term associated
23
CLEF eHealth 2015-2016information extraction [Névéol et al., 2015, 2016]
• Task: Automatically identify clinically relevant entities in
medical text in French
• Objective: Establish the state-of-the-art for a language other
than English on core biomedical NLP tasks:
• Named Entity Recognition (with embedded entities)
– Mention level :“diabète de type 2”, “DNID”, “diabète non
insulino dépendant
• Entity Normalization
– Concept level: for instance the three mentions above can be
normalized to the same UMLS concept: C0011860
Plain Entity Recognition
– Given: plain text
MEDLINE
La contraception par les dispositifs intra utérins
EMEA
Dans quel cas Tysabri est-il utilisé?
Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).
– Expected: plain entity annotations
MED
LINE
EMEA
24
Normalized Entity Recognition
– Given: plain textMEDLINE
La contraception par les dispositifs intra utérins
EMEA
Dans quel cas Tysabri est-il utilisé?
Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).
– Expected: normalized entity annotations
25
C1529600C1529600 C0087111
C0001675
C0026769
C0026769
C0700589 C0344221 C0021900
C0042149
MED
LINE
EMEA
– Given: plain entity annotations
– Expected: normalized entity annotations
MED
LINE
EMEA
Entity Normalization
26
C1529600C1529600 C0087111
C0001675
C0026769
C0026769
C0700589 C0344221 C0021900
C0042149
MED
LINE
EMEA
Methods used
27
• For entity recognition (7 teams in 2015, 5 teams in 2016)– Machine learning, e.g. (cascading) CRF
– Lexical matching
– sources: dictionary built from training data or existing terminologies
– matching method: n-gram, bag-of word
– Machine translation + Metamap
• For entity normalization (3 teams in 2015, 2 teams in 2016)– Lexical matching
– Machine translation + Metamap
Evaluation metrics
• Precision, recall and F-measure
– P =
– R =
– F =
• Primary metric was exact match micro-averaged F1 over all
entity types
– Inexact match results were generally higher but exhibited the
same trend as exact match28
true positives
true positives + false positives
true positives
true positives + false negatives
(1+ β)2 x P x R
β2 x (P + R)
Results: plain entity recognition, EMEA
(2016)
29
Zipf 0.734 0.434 0.546
Results: plain entity recognition, MEDLINE
(2016)
30
Zipf 0.726 0.300 0.425
CépiDC corpus excerpt [Lavergne et al. 2016]
31
Malnutrition dehydration
Advanced mixed dementia (late stage)
Idiopathic Parkinson Disease
Recent angioedema in upper extremities, no CT (unlikely to be drug induced) .
2013;2;85;4;1;DENUTRITION DESHYDRATATION; E46 E86
2013;2;85;4;2. DEMENCE MIXTE EVOLUEE (stade sévère); F03
2013;2;85;4;5. Maladie de Parkinson idiopathique Angioedème des
membres sup récent non exploré par TDM (à priori pas de cause
médicamenteuse); G200 R600
Coded
in 2013Deceased was an 85 y. o. female
Death occured in a hospice or
retirement home
Death certificate line number
ICD10 codes
Results: ICD10 coding, CépiDC
32
Zipf 0.531 0.245 0.336
# systems 5 4 3 2 1 0
# codes 29,100 25,215 20,743 15,933 10,685 7,714
Lessons learned
• Text pre-processing is important
- (Formatting yielded technical problems in 2015)
• Using terminology information in more than one language was
successful
– Terms translated from English helped
– Even extensive French resources were limited
33
Additional remarks
• 2 out of 7 teams were from non-French speaking countries
• This continues to be the only biomedical international challenge
addressing a language other than English
–Tasks were challenging
– ICD10 coding yielded higher performance
• Both knowledge-based and machine learning methods showed
potential for ICD10 coding
34
35/33
Acknowledgments
LIMSI
L. Campillos
L. Deléger
E. D’hondt
C. Grouin
T. Hamon
T. Lavergne,
A-L. Ligozat
F. Morlane-Hondère
C. Rabary
X. Tannier
MD. Tapi-Nzali
P. Zweigenbaum
HEGP
A. Burgun
J-B Escudié
A-S Jannot
A-D. Pham
B. Rance
Harvard Children’shospital
G. Savova
P. Chen
CHU de Rouen
S-J Darmoni
N. Griffon
J. Grosjean