Natural Language Processing: A Healthcare-Oriented Tutorial Philip Resnik, Ph.D. March 22, 2011.
-
Upload
lora-hoover -
Category
Documents
-
view
217 -
download
0
Transcript of Natural Language Processing: A Healthcare-Oriented Tutorial Philip Resnik, Ph.D. March 22, 2011.
Natural Language Processing:A Healthcare-Oriented Tutorial
Philip Resnik, Ph.D.March 22, 2011
What is natural language processing?
• Natural language = naturally occurring human language– Excludes Java, C++, Klingon, …
• Natural language processing– The branch of information science that deals with
natural language information (syn: NLP, human language technology) [WordNet 1.7]
– a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of particular tasks or applications [E. Liddy, ASIS bulletin, April 1998]
NLP is everywhere…
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Personal background
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Hiya, Hal. Can you believe Tennessee beat the Giants on Sunday?!
• SHRDLU (Winograd)
– Person: PICK UP A BIG RED BLOCK.– Computer: OK. (does it)
…– Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT
INTO THE BOX.– Computer: BY ”IT”, I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I
AM HOLDING.– Computer: OK. (does it)– Person: WHAT DOES THE BOX CONTAIN?– Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.– Person: WHAT IS THE PYRAMID SUPPORTED BY?– Computer: THE BOX.
block n.1. a. A solid piece of a hard substance, … having one or more flat sides.b. Such a piece used as a construction member or as a support.c. Such a piece upon which chopping or cutting is done: a butcher's block.d. Such a piece upon which persons are beheaded.
e. One of a set of small wooden or plastic pieces, such as a cube, bar, or cylinder, used as a building toy.f. Printing A large amount of text.g. Sports A starting block.2. A stand from which articles are displayed and sold at an auction: Many priceless antiques went on the block.3. A mold or form on which an item is shaped or displayed: a hat block.4. A substance, such as wood or stone, that has been prepared for engraving.5. a. A pulley or a system of pulleys set in a casing.b. An engine block.6. A bloc.7. A set of like items, such as shares of stock, sold or handled as a unit.8. A group of four or more unseparated postage stamps forming a rectangle.9. Canadian A group of townships in an unsurveyed area.10. a. A usually rectangular section of a city or town bounded on each side by consecutive streets.b. A segment of a street bounded by consecutive cross streets and including its buildings and inhabitants.11. A large building divided into separate units, such as apartments.12. A length of railroad track controlled by signals.13. The act of obstructing.14. Something that obstructs; an obstacle.15. a. Sports An act of bodily obstruction, as of a player or ball.b. Football Legal interference with an opposing player to clear the path of the ball carrier.16. Medicine Interruption, especially obstruction, of a normal physiological function: nerve block.17. Psychology A sudden cessation of speech or a thought process without an immediate observable cause, sometimes considered a consequence of repression. Also called mental block.18. Slang The human head: threatened to knock my block off.19. A blockhead.
Artificial examples
MIT
Stanford
Edinburgh
LUNAR (Woods, 1973)
What is the average concentration of iron in ilmenite?
Give me references on sector zoning.
What is the average weight of all your samples?
Why didn’t it work?• Ambiguity and the lack of world knowledge
– Iraqi Head Seeks Arms
Lack of scalability / need for hand-crafting How many rules does it take to “get it right”, and how do they get
written and kept up to date? Lack of context awareness
The rule works in one context, but what about all the other contexts you haven’t considered?
Brittleness in the face of natural variability What happens when you get unexpected input?
Lack of confidence assessment How does the system “know” if it’s wrong?
… k uh k q ah dh ow z t ow m ey dx z d ux d …
… you should cook those tomatoes, dude …
Observed unstructured input
Correct structure
Speech recognition – a different approach
Embarrassingly concocted example and unrelated waveform.
Automatically learned pattern matching(This network recognizes a variety of valid pronunciations for “tomato”)
Effective methods• Despite a lack of
world knowledge• Without labor
intensive hand crafting
• Accepting of a wide variety of human variation
NLP researchers
Speech researchers
¡Viva la revolución!
• 1990s– DARPA forces speech
recognition researchers and NLP researchers to get together.
– “Statistical revolution” in NLP ensues.
P. K. Agarwal and M. Sharir. Algorithmic techniques for geometric optimization. In Computer Science Today: Recent Trends and Developments, volume 1000 of Lecture Notes Comput. Sci., pages 234--253. Springer-Verlag, 1995.
author = "Pankaj K. Agarwal and Micha Sharir", title = "Algorithmic Techniques for Geometric Optimization", booktitle = "Computer Science Today", pages = "234-253", year = "1995
Observed unstructured text
Correct structure
Example: learning how to find the structure in unstructured text
Source: Geng (2002)
Source: Geng (2002)
Automatically learned model of patterns using examples of correct structure (This network recognizes a variety of valid formats for bibliography entries)
It’s not all statistics, is it?
• Rule-based “preprocessing”– Dividing the text into basic units (“tokens”)
• Categories – Booktitle, Journal, Volume, Year…
• Hard constraints and rules – Years must have 4 digits and start with ‘19’ or ‘20’
• Structure of the statistical model– Bibliography entries are “beads on a string”
Why do statistical methods work well?
– Discovering the patterns in the data• Especially from lots of correctly answered cases
Creating “soft” constraints• (e.g. Year probably signals the end of a bibliography
entry, but it doesn’t have to) Graceful handling of variability, ill-formedness
• System recognizes unforeseen input as less likely, rather than treating it as unprocessable.
Sources: graph adapted from Church, K. (2003) “Speech and Language Processing: Where have we been and where are we going,” Eurospeech, Geneva, Switzerland. Green circle data have been added from figures in Cardie and Mooney (1999).
0%20%40%60%80%
100%
1985
1990
1995
2000
2005
Annual Meeting of the Association for Computational Linguistics
% “Statistical” Papers
The statistical revolution in NLP
Driving progress: evaluation
• Up to the mid-1980s, an NLP system was evaluated by watching a demonstration.
• Over the last 20 years, NLP systems are subjected to more rigorous evaluation and measurement.
• This has been a primary driver of progress in the field.
Method Precision Recall F-measure
Baseline 34.5 42.5 38.1
Method 2 59.4 59.4 59.4
(+55.9%)
Method 3 68.0 66.6 67.3
(+76.6%)
Typical evaluation in NLP
Obvious, simple technique or previous state of the art
Multiple relevant measures, often capturing a tradeoff
Single “figure of merit” for cross-system comparison
Alternative methods or systems being evaluated
Source: Church, K. (2003) “Speech and Language Processing: Where have we been and where are we going,” Eurospeech, Geneva, Switzerland.
What is NLP today?• Applications and tasks, not “understanding”• Finding the structure in unstructured text• Learning to make good predictions, often from lots of examples that include the correct answer• Combining knowledge sources with data-driven techniques
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Billing
NLP Engine RoutingCoder Review
Traditional Coding
CAC Landscape
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside…
Relevant steps from an NLP perspective
• Context: demographics, codeset, payer specifics…• Identifying document regions• Identifying information units• Combining information units • Creating an internal representation • Mapping to/prediction of codes• Coding logic
Note: I am not describing any specific system! Examples are constructed for this presentation.
Mrs. Zoe is a 57-year-old female who has been having chest pains which she describes as a sharp pain, located substernally occurring at night when she tries to lie on her right side. She has not had any exertional type chest discomfort and the discomfort in her chest will last as long as she is lying in that position. …
Mrs. Zoe exercises daily, walking one and a half to three miles and also uses some weights. Her mother is age 81 and has a history of angina and congestive heart failure along with atrial fibrillation.
Weight is 150 pounds, stable. No history of thyroid dysfunction. No renal dysfunction. No gastrointestinal symptoms. No asthma, wheezing, or lung problem. Is having menopausal symptoms. No claudication. Neurologic is negative.
Identifying document regions
Mrs. Zoe is a 57-year-old female who has been having chest pains which she describes as a sharp pain, located substernally occurring at night when she tries to lie on her right side. She has not had any exertional type chest discomfort and the discomfort in her chest will last as long as she is lying in that position. …
Mrs. Zoe exercises daily, walking one and a half to three miles and also uses some weights. Her mother is age 81 and has a history of angina and congestive heart failure along with atrial fibrillation.
Weight is 150 pounds, stable. No history of thyroid dysfunction. No renal dysfunction. No gastrointestinal symptoms. No asthma, wheezing, or lung problem. Is having menopausal symptoms. No claudication. Neurologic is negative.
History of present illness
Past medical history
Family history
Review of systems
Some other kinds of regions
• Negated– No evidence of pneumonia
• Equivocal or Modal– … could represent atelectasis…– … likely fracture…
Sentence breakingMrs. Zoe is a 57-year-old female who has been having chest pains which she describes as a sharp pain, located substernally occurring at night when she tries to lie on her right side. She has not had any exertional type chest discomfort and the discomfort in her chest will last as long as she is lying in that position. …Mrs. Zoe exercises daily, walking one and a half to three miles and also uses some weights. Her mother is age 81 and has a history of angina and congestive heart failure along with atrial fibrillation. Weight is 150 pounds, stable. No history of thyroid dysfunction. No renal dysfunction. No gastrointestinal symptoms. No asthma, wheezing, or lung problem. Is having menopausal symptoms. No claudication. Neurologic is negative.
Morphological analysis
Mrs. Zoe is a 57-year-old female who has been having chest pains which she describes as a sharp pain, located substernally occurring at night when she tries to lie on her right side.
= pain + PLURALIn this context, pains is the same as pain.
Sometimes singular vs. plural matters, e.g. cyst is different from cysts.
English morphology happens to be pretty simple. It’s not as simple for other languages.
sub belowsternal sternumly (related to)
Approaches to identifying/combining information units
Creating internal representations of evidence
Symptom: pain
Degree: sharp
Loc: chest
LocMod: substernal
Source: HPI
…
Chest pains which she decribes as a sharp pain...Sharp pain in her chest…Sharp chest pain…Chest pain which feels sharp…
An aside: words, terms, and meanings
• Multi-word expressions are logical units– Myocardial infarction
• Synonymy: many expressions one meaning– Myocardial infarction, MI, heart attack
• Ambiguity: one expression many meanings– Neck, head, depression
• These issues can be addressed using knowledge-based methods (e.g. terminologies) and/or statistical methods.
Copyright © 2007 by the American Health Information Management
Association. All rights reserved.
Another aside: ontologies• Ontologies are (typically hierarchical) specifications of
concepts and relationships between concepts, which encode knowledge about a domain.
• “Coding” can also map from language to concepts in an ontology.
• Ontologies support limited forms of reasoning and inference.
• This is not “understanding” in any usual sense of the term.plasmacytoma
steroid
cancer
drug
prednisone
disease
treats
Back from our asides:Mapping to/predicting codes
• Rule-based matching– Match representation assign code
• Statistical prediction– Statistical prediction of code based on aggregated data
Symptom:pain
Degree: sharp
Loc: chest
LocMod: substernal
Source: HPI
…
One expert’s judgment.
719.07
Will another expert agree?
824.8
Lots of evidence reliable conclusions
824.9
824.8824.8824.8824.8824.8824.8824.8
719.07719.07
Statistical prediction
Machine learning: an example
Tom Mitchell (1997), Machine Learning, McGraw-Hill
Cf. C. Sims et al., Predicting cesarean delivery with decision tree models, Annual Meeting of the Society for Maternal-Fetal Medicine No11 (31/01/2000) 2000, vol. 183, no 5, pp. 1049-1231 (14 ref.), pp. 1198-1206
If
normal fetal presentation and no previous C-section and first pregnancy and no fetal distress and birth weight > 3349g
Then
Predict C-section with likelihood of 22%
Machine learning: an example
If
normal fetal presentation and previous C-section
Then
Predict C-section with likelihood of 39%
(regardless of first pregnancy. fetal distress, birth weight)
Machine learning: another example
??
Evidence 1
Evidence 2
(Now do this in 5,000 dimensions…)
Coding logic
• General logic– Pertinent vs. incidental findings– Choice of primary code– Code combination
• Client or payer-specific logic
Codes (along with the evidence that produced
them)
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Recall: did you assign all the codes you should have?
Precision: did you assign any extra codes you shouldn’t have?
Hand-crafted rules
Matching terms in terminologies
Machine learning
How well are we doing?
How NLP people evaluate NLP systems
• Evaluation by demonstration• Evaluation by inspection of examples• Evaluation by unscripted demonstration• Evaluation on data using a figure of merit• Evaluation on test data using an automatic metric• Evaluation on common test data• Evaluation on common, unseen test data
Analysis and insights driving improvement
Transcription
Billing
NLP Engine RoutingCoder Review
QA
Traditional Coding
CAC Evaluation Landscape
AuditTraditional evaluation methods
QA
CAC Evaluation Landscape• What auditing (post hoc
evaluation) can’t provide:– Replicability: ability for you to
implement my system and verify that we get exactly the same result
– Comparability: ability to evaluate two different systems fairly against each other
– Tracking: ability to evaluate one system against itself at two points in time
– Automation: ability to perform rapid “devtest” evaluations
– Fidelity: Avoiding the "benefit of the doubt effect”, which inflates inter-coder agreement estimates
SystemTest set SystemTest set
The NLP community has converged on a standard approach to these problems…
Gold Standards in NLP Technology Evaluation
“Gold standard”: annotated test set• Create a representative set of test items
– Make sure it’s not used for development! • Have multiple annotators independently provide
their correct answers• Adjudicate inter-annotator disagreements
– Consensus by discussion, voting, …• Define the “upper bound” on performance as
pre-adjudication inter-annotator agreement.
Gold ≠ Perfect• NLP gold standards accommodate legitimate gray areas
阿富汗地震灾民开始重建家园
• NLP gold standard evaluations define human upper bounds:[We] estimate an upper bound on performance by estimating the ability for human judges to agree with one another
(Gale, Church, and Yarowsky, 1992) location
organization
earthquake victims in afghanistan start to rebuild homelandearthquake victims start reconstruction in afghanistanafghans begin restoring home after quakeafghan earth quake victims begin to rebuild their homes
Sanchez went to the bank for a loan
Transcription
Billing
NLP Engine RoutingCoder Review
QA
Traditional Coding
Gold standard
Intrinsic and Extrinsic NLP Evaluations
Intrinsic evaluation
Extrinsic evaluation
AuditTraditional evaluation methods
Language technology evaluation standards
QA
Upper bounds
Inter-annotator agreement
Intra-annotator agreement
Intrinsic and Extrinsic NLP Evaluations: Example
…trauma …
…trauma…
How accurately do we resolve ambiguous terms?
How well do we facilitate searches for clinical information?
C0043251: Injuries and Wounds:Wounds and Injuries: trauma: traumatic disorders: Traumatic injury:
C0597316: Shock; psychological shock
Measures of Effectiveness
• Good measures of effectiveness should– Capture some aspect of what the user wants
• Pertinent, valid, meaningful
– Have predictive value for other situations• Different test data, different coders
– Be easily replicated by others– Be expressed as a single number
• Allows two systems to be easily compared
Some Principles for Good Language Technology Evaluation
• Pertinent evaluation metrics• Replicable evaluation metrics• Reporting all relevant experimental parameters• Establishing upper bounds on performance• Establishing lower bounds on performance• Testing statistical significance• Never allowing developers to see the test data
Evaluation using recall and precision
• Evaluation metrics– Precision– Recall– F = 2PR/(P+R)
Precision P =35
Recall R = 3 6
Evaluee’s output
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on
system1
system2
system3
Recall/Precision tradeoffs
Auto-coding/Accuracy tradeoffs
• It is meaningless to report recall without reporting precision (or vice versa).– Any system can easily get high recall by sacrificing
precision.
• For the same reason, it is meaningless to report direct-to-bill volumes without also reporting how accurate the codes are.
• To measure NLP engine accuracy– Compare the NLP engine output to the final codes
assigned during human QA/reviewing.– Assume any changed code was incorrect. – Assume any unchanged code was correct.– Accuracy is simply
#correct / (#correct + #incorrect)– “Change rate” = 1 accuracy
Another metric: coder change rates
• One limitation we would expect– “Benefit of the doubt” effects lead to inflated estimates
of coding accuracy. (Morris et al. 2000, Nossal et al. 2006)– This is a problem with all post hoc methods, including
formal auditing.
• Another limitation we would not expect– It turns out that coders sometimes change correct auto-
generated codes to incorrect codes.– This is true for both CPT and ICD codes.– This is true even for good coders.– (See Stoner et al. 2006.)
Limitations of coder change rates
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
66
Confidence: A ParadoxThe more reliable a system is, the more we trust it.
But the more we trust it, the more important that the system itself alert us when we shouldn’t trust it!
An automatic translation system’s output (from Arabic):
U.S. is a terrorist state and says Syria.
Seems pretty good, right?
The real translation: U.S. says Syria is a terrorist state. !!!
Principled confidence measures
• The more accurate the automated technology gets, the more people trust it.
• Therefore, the more important it is for the system to assess accurately for itself whether or not its decisions require human review.
• Computer assisted coding (CAC) systems need a principled basis for evaluating their own correctness at run time, in order to avoid representing sub-par coding results as trustworthy.
Some Possibilities
• Auditor = Coder: using internally-driven engine confidence
• Rules of thumb: “When the engine assigns these codes, they’re generally right…”
• Table driven: allow only valid CPT/ICD combinations
• Confidence assessment (uses CPT/ICD only): Pr(Correct | CPT, ICD)
• Situated confidence assessment (uses context):Pr(Correct | CPT, ICD, evidence, steps)
Billing
NLP Engine Routing
Engine Coding versus Confidence Assessment
Metadata evidence
Language evidence
Looking at all the evidence in the chart, which codes are the best choice?
Looking at chosen codes and how the evidence led to them, how confident are we that those codes are correct?
Sufficiently confident
“Coder” “Auditor”
Subject: B!G OPP@RTUN1TYTo: [email protected], [email protected], alice@foo,com, [email protected], ……
W A R N I N G !IF YOU RECEIVE AN E-MAIL WITH SUBJECT"A VIRTUAL CARD FOR YOU"DO NOT OPEN !!!IT CONTAINS A VERY, VERY DANGEROUS VIRUS.IT WAS CLASSIFIED YESTERDAY BY MICROSOFT AND MCAFEE AS THE MOST DESTRUCTIVE VIRUS OF ALL TIMES. VIRUS DESTROY HARD DISK. WITHOUT POSSIBILITY TO REPAIR. PLEASE SEND THIS MESSAGE TO EVERYBODY YOU KNOW !
Mail CategoryLabeling Routing
An Analogy: Mail Routing
E-mail header
E-mail body
Looking at all the evidence in the e-mail message, should we code this mail as “good” or as “spam”?
Looking at the choice that was made (“good” or “spam”) and at the evidence leading to that choice, how confident are we that the choice is correct?
evidence
If your mail’s going straight here, your routing had better be trustworthy!
Sufficiently confident
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Evaluating how well a system is doing• Confidence estimation and “auto-coding”• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Understanding rule-based and statistical methods
• The terms “rule based” and “statistical” show up frequently when people are trying to assess NLP-based CAC solutions.
• The characterization of these language technology approaches seems to be confusing to a lot of people.
• This piece: help clarify what these terms mean, so that potential users of the technology understand how they relate to each other, and have an idea what questions to ask.
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis.
history - head plasmacytoma
0 - skull mass 0 - head {abnormality} 0 - head plasmocytoma
CPT: 70553
ICD: 784.2
type modifier bodypart diag/problem
delimit identify normalize extract predict apply_logic assess
Let’s start with an exercise
• We’re building a system to recognize spoken medical terms.
• Let’s handle two words:– infarction– infection
• What does the system need to “know”?
in FARK shunshin
then it should recognize the term ‘infarction’
if the system detects
en
Did you remember FACK?
If not, there’s a doctor in Hahvid Yahd who’s upset with you.
FACK???
(in or en) (FARK or FACK) (shun or shin)
then it should recognize the term ‘infarction’
if the system detects
RULE
PATTERN (antecedent)
ACTION/CONCLUSION (consequent)
(in or en) (FECK) (shun or shin)
then it should recognize the term ‘infection’
if the system detects
in FECK shunFARK
What do we do now?
Use context… a severe ___________
… a previous myocardial ___________
Did you remember related to?
___________ related to…
Rule-based methods
• Encode expert knowledge• Are generally human-readable
• Historically have had trouble with the variety and variability of real-world language use
• Make all-or-nothing decisions, rather than encoding gradations or confidence (in a principled way)
• Are challenging to support
Rule-based NLP
Eight documents??!!(Eight documents?!)
Hirschman and Sager (1976)
19
85
19
90
19
95
20
00
20
05
1983 — 1993“the return of empiricism… probabilistic models throughout speech and language processing”
2000— 2008 “the rise of machine learning”
Jurafsky and Martin (2009), Speech and Natural Language Processing
19
70
19
83
1970 — 1983“natural language understanding”
1994 — 1999 “the field comes together”
… a severe ___________
… a previous myocardial ___________
___________ related to…
Machine learning: a very simple example
infection infection infection
infarction infection infectioninfection infection infection
infection infection infection
.90 infection
.10 infarction
infarction infarction infarction
infarction infarction infarctioninfarction infarction infarction
infarction infarction infarction
.0001 infection
.9999 infarction
infection infection infection
infarction infection infectioninfection infection infection
infection infection infection
.05 infection
.95 infarction
Machine learning: another example
http://www.nytimes.com/2010/03/09/technology/09translate.html (page A-1, March 9, 2010)
Statistical methods
• Learn automatically from observations of data• Can have the same kinds of structure as manually
written rules• Require representative data to learn from
• Provide confidence measures• Can be tuned to balance recall/precision
A view of the state of the art in NLP
Rule-based methods
Rule-based methods
Statistical NLP
Machine learning
Rule-based methods informed by large scale
data analysis
Data
A very recent example: Watson
• “A massively parallel probabilistic evidence-based architecture”– Exploits manually constructed knowledge
• Liquid IS-A Fluid (1, WordNet)
– Learns from large volumes of naturally occurring text.• Fluid IS-A Liquid (0.7)
• Employs confidence estimation pervasively– Relates its internal confidence to cost of incorrect answers
• Data driven and evaluation driven: evolved via continuous objective evaluation and an agile, omnivorous approach to development.
A very recent example: Watson
“As our results dramatically improved, we observed that system-level advances allowing rapid integration and evaluation of new ideas and new components against end-to-end metrics were essential to our progress.”(Ferrucci et al., 2010)
Watson’s overarching principlesMassive parallelism: Exploit massive parallelism in the consideration of multiple interpretations and hypotheses.
Many experts: Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics.
Pervasive confidence estimation: No component commits to an answer; all components produce features and associated confidences, scoring differ- ent question and content interpretations. An underlying confidence-processing substrate learns how to stack and combine the scores.
Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.
(Ferrucci et al., 2010)
Questions to think about when looking at statistical NLP systems
• What is the quality of the expert knowledge that has gone into the system? (Who are the experts?)
• Where in the system are machine learning techniques employed? (And again, who are the experts?)
• How much data has the system learned from?• How much variety was there in the data?• Where and how does the system employ confidence
estimation?
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Confidence estimation and “auto-coding”• Evaluating how well a system is doing• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Beyond coding for billing…
• Electronic capture and presentation of patient, demographic and clinical information
• Outcomes analysis• Clinical decision support• Pharmacovigilance• Biosurveillance• Knowledge discovery
Transcription
Payers
NLP Engine RoutingCoder review
Clinical Information Landscape
Clinicians
Researchers
Policy makers
Patients
Traditional coding
Unrestricted physician language
Codes
Data, Information, Knowledge, Wisdom
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Original Unrestricted Unprocessed
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
Adds structure and categories to create units
Data, Information, Knowledge, Wisdom
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
QD100 mgprednisone
QD9 mgmelphalan
FRQDOSENAME
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Identifies relationships between units
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
plasmacytoma
steroid
cancer
drug
prednisone
disease
treats
Data, Information, Knowledge, Wisdom
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Ability to make good choices
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
plasmacytoma
steroid
cancer
drug
prednisone
disease
treats
Individual clinical
expertiseBest external
evidence
Patient values and expectations
Data, Information, Knowledge, Wisdom
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
plasmacytoma
steroidcancer
drug
prednisone
disease
treats
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
Information
• The process of knowledge discovery is a natural cycle
• At every iteration, information emerges from data by structuring and categorizing the data according to what we know now
• As we improve our knowledge, those structures and categories change
Data
KnowledgeKnowledge
Data
Information
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
plasmacytoma
steroidcancer
drug
prednisone
disease
treats
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
Information
• The process of knowledge discovery is a natural cycle
• At every iteration, information emerges from data by structuring and categorizing the data according to what we know now
• As we improve our knowledge, those structures and categories change
Data
KnowledgeKnowledge
Data
Information
Transcription
Clinicians
Researchers
Policy makers
Patients
Unrestricted physician language
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Data
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
plasmacytoma
steroidcancer
drug
prednisone
disease
treats
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
Information
KnowledgeKnowledge
Data
What happens if physicians enter structured information directly, instead of the original data?
How we transform data into information depends on our current state of knowledge.
Transcription
Clinicians
Researchers
Policy makers
Patients
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
plasmacytoma
steroidcancer
drug
prednisone
disease
treats
Hypotheses:
Cryptic vascular malformation
Chronic lacunar infarct
Information
KnowledgeKnowledge
The full clinical narrative never comes into existence.
Potentially relevant information is lost forever.
The knowledge discovery cycle is broken.
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when compared with 11/16/04, but is considerably smaller than 5/3/04. The infratemporal soft tissue component of the lesion has resolved. No new or progressing bone lesion. Incidental note is made of a small amount of hemosiderin deposition within the cortex of the left parietal operculum without abnormal enhancement. This could represent cryptic vascular malformation, or chronic lacunar infarct. Mild cerebral leukoaraiosis. …
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
QD100 mgprednisone
QD9 mgmelphalan
FRQDOSENAME
Mr. John Doe was seen in our office today in follow up of his paroxysmal atrial fibrillation. . . . He recently called our office in February stating he was back in atrial fibrillation which was documented on electrocardiogram. I elected to increase his Betapace to 160 mg twice a day and he did convert back to normal sinus rhythm. We had recommended Coumadin to him at that time but he did not start any Coumadin. He has done well since with no recurrence of arrhythmia and he is acutely aware of when he goes into the fibrillation. . . .He seems to be doing well on the increased dose of Betapace 160 mg twice a day. I told him he should take a daily baby aspirin and also that if he has recurrent episodes of fibrillation, he needs to let us know because I think he would need to be on Coumadin anticoagulation and may need an adjustment in his antiarrhythmic regimen.
If the full clinical narrative never comes into existence…
There is clear evidence that this patient’s self-reports are trustworthy and relevant. In your thinking on his clinical
care, you should make sure to pay attention to them.
Here’s the reasoning connected to my recommendation of Coumadin, the status of that recommendation, and the
circumstances under which I think the recommendation should be revisited.
…the knowledge discovery cycle is broken.
NEW YORK (Reuters Health), Jun 19 - Unlike adults with intracranial aneurysms, most intracranial arterial aneurysms in pediatric patients are idiopathic and are associated with no known risk factors for vascular disease, investigators reported at the American Society of Neuroradiology's annual meeting in Chicago.
"Our study suggests that -- unlike the adult disease -- childhood aneurysms may be driven by unique predisposing factors that we have not yet identified. It could have much less to do with underlying conditions commonly thought to contribute to their development," presenter Dr. Todd Abruzzo told Reuters Health. …
Dr. Abruzzo, an interventional neuroradiologist at the University of Cincinnati in Ohio, and associates conducted a review of records from three tertiary referral hospitals between 1993 and 2006. …
If the full clinical narrative never comes into existence…
• What if the true risk factors we look are not part of any EHR’s structured nomenclature?
• What if the relevant factors are expressed in current clinical narratives using non-clinical terminology?
• If physicians enter nomenclature directly, instead of the full narrative, how will we ever know what information we have lost?
• Without the original data, we can never reanalyze physicians’ observations in the light of new knowledge and new categories.
The knowledge discovery cycle is broken
• There are good reasons to standardize the representations in health information systems (interoperability, data mining, etc.)
but
• The data-information-knowledge-wisdom analysis forces us to ask what will be lost, if clinicians’ input language is standardized.
A dilemma
• Recognize that standardized representations are different from standardized input.
• Allow clinicians to express the clinical narrative in all its richness and nuance, through natural dictation.
• Transform their natural language into representations that permit standardization and interoperability.
NLP as a solution
NLP Engine
Clinicians
Researchers
Policy makers
Patients
plasmacytoma
steroidcancer
drug
prednisone
disease
treats
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
ORDER_EXAM: MRI Hd wo&w ORDER_IND: head - h/o plasmacytoma^ MR head, without and with IV gadolinium. Comparison is made with previous outside MR head examinations 5/3/04 and 11/16/04. On the earliest outside examination, there was a mass in the right central skull base, extending infratemporal fossa, sphenoidsinus, and foramen ovale. This subsequently was demonstrated to represent a plasmacytoma. This mass is markedly reduced in size on the subsequent outside MR. Our examination continues to show abnormal signal and peripheral enhancement, in the right central skull base, and involving right clivus, right sphenoid, and base of right pterygoid. This is probably stable when
Transcription
Clinical history:
History of plasmacytoma (head)
Mass in right central skull base
Mass subsequently reduced in size
Findings:
Abnormality in right central skull base…
Mass smaller in size
Mild cerebral leukoaraiosis
Current medications:
Melphalan 9 mg/m2 per day
Prednisone 100 mg/day
Physicians focus on the care of the patient and communicate unimpeded, full, narrative clinical data.
Informed by the best current knowledge and data, language technology transforms clinical language into standardized, interoperable, available information.
Both health information technology and medical communities of practice inform, and are informed by, evolving medical knowledge.
Overview
• Personal background• Historical perspective and high level overview• Elements of NLP for computer assisted coding• Confidence estimation and “auto-coding”• Evaluating how well a system is doing• Understanding rule-based and statistical methods• The importance of natural language input• Where NLP is headed• Some key take-aways
Using text to predict the real world
O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; Smith, N. A. 2010. From tweets to polls: linking text sentiment to public opinion time series. Proc. ICWSM pp. 122-129.
Using text to predict the real world
Using text to predict the real world
Topics and agendas05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
Can word frequencies tell you about the topics in a set of documents?
Topics and agendas05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
Looking at just word counts often gives you
a mish-mash.
Topics and agendas05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
05_03_02.txt.0002 BEGALA Good evening. Welcome to CROSSFIRE,coming to you live from the George Washington University in beautifuldowntown Washington, D.C. Tonight in the CROSSFIRE, the case ofthe Reverend Paul Shanley, the Roman Catholic priest facing childrape charges in Massachusetts. Should his superiors be heldresponsible? Also, Matt Drudge, founder of the Internet "DrudgeReport." Is he a right-wing muckraker, an Internet gossip or alegitimate journalist? We'll ask Drudge himself when we get himin the CROSSFIRE. First, flying the not-so-friendly skies, wouldyou feel safer if pilots were armed? One outspoken congressionalcritic is against having guns in the cockpit. We're going tointroduce her now. Please welcome, Eleanor Holmes Norton, theDemocratic delegate from the District of Columbia. Ms. Norton, thankyou. Welcome back.
05_03_02.txt.0003 CARLSON Now, Ms. Norton, the majority,the vast majority of commercial airline pilots are strongly in favorof carrying guns in the cockpit on commercial airliners. You'reagainst it. What do you as a delegate know about operating acommercial airliner that the majority of commercial airline pilotsdon't know
05_03_02.txt.0004 DELEGATE Well, I know whatTransportation Secretary Norm Mineta tells me, and I know whatHomeland Security Adviser Tom Ridge tells me, and they are againstit. And I think the reason they are against it is you don't wantthe guy who's flying one of these big busters up there also with agun in his hand trying to protect his plane. You want air marshalsto do that. You want flight attendants to understand how to protectthe cockpit. And you want the redundancies that we have built in,redundancy after redundancy, working for you. We are panicking theAmerican people. They say, oh my God, I thought they had thehearings, I thought they did that. Here come the pilots saying,oh no, they haven't. We've got to have guns.
Bayesian topic models* discover the distinct topics interwoven in documents.
*Wikipedia: Topic Model; Blei et al. 2003
.03
.44
.00
.11
Topics and agendas
Any part of the conversation can be viewed as a
mixture of topics.
Well, I know what Transportation Secretary Norm Mineta tells me, and I know what Homeland Security Adviser Tom Ridge tells me, and they are against it. And I think the reason they are against it is you don't want the guy who's flying one of these big busters up there also with a gun in his hand trying to protect his plane. You want air marshals to do that. You want flight attendants to understand how to protect the cockpit. And you want the redundancies that we have built in, redundancy after redundancy, working for you. We are panicking the American people. They say, oh my God, I thought they had the hearings, I thought they did that. Here come the pilots saying, oh no, they haven't. We've got to have guns.
Thoughts on where NLP is headed• Watson’s key lessons
– Bringing together data-driven methods with knowledge– Pursuing paths in parallel and combining evidence– Constant data-driven assessment/evaluation– Pervasive confidence estimation
• Supervised learning methods– Using human choices/behavior as basis for prediction
• Semi-supervised learning methods– Taking additional advantage of raw data
• Unsupervised methods– Discovering structure in text
and what it reveals about the real world
Some references (general)• Philip Resnik and Jimmy Lin, ``Evaluation of NLP Systems'', in Alex Clark, Chris
Fox, and Shalom Lappin (eds.), Handbook of Computational Linguistics and Natural Language Processing, Wiley Blackwell, June 2010.
• Philip Resnik, "Word Sense Ambiguity". International Encyclopedia of Linguistics, 2nd edition. William J. Frawley, editor. Oxford University Press: Oxford, England, 2003.
Some references (healthcare)• Philip Resnik, Michael Niv, Michael Nossal, Andrew Kapit, and Richard Toren,
“Communication of Clinically Relevant Information in Electronic Health Records: A Comparison between Structured Data and Unrestricted Physician Language,” Perspectives in Health Information Management: Computer Assisted Coding Conference Proceedings, AHIMA, Fall 2008.
• Philip Resnik Michael Niv, Michael Nossal, Gregory Schnitzer, Jean Stoner, Andrew Kapit, and Richard Toren, “Using Intrinsic and Extrinsic Metrics to Evaluate Accuracy and Facilitation in Computer Assisted Coding,” Perspectives in Health Information Management: Computer Assisted Coding Conference Proceedings, AHIMA, Fall 2006.
• Yuankai Jiang, Michael Nossal, and Philip Resnik, “How Does the System Know It's Right? Automated Confidence Assessment for Compliant Coding,” Perspectives in Health Information Management: Computer Assisted Coding Conference Proceedings, AHIMA, Fall 2006.
• Michael Nossal, Philip Resnik, Jean Stoner, “Assessing Coder Change Rates as an Evaluation Metric,” Perspectives in Health Information Management: Computer Assisted Coding Conference Proceedings, AHIMA, Fall 2006.