This webinar is being recorded
Natural Language
Processing and Machine
Learning: Beyond the Hype
A Pistoia Alliance Debates Webinar
Moderated by David Milward –Linguamatics
September 14, 2017
This webinar is being recorded
Poll Question 1: What role do you play in
your company?
A. IT
B. Data scientist/bioinformatician
C. Clinical/bench scientist
D. Information professional
E. Other
© P
isto
ia A
llia
nce
The Panel
5
David Milward, Ph.D., CTO LinguamaticsDavid Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.
Chengyi Zheng, Ph.D. , NLP Specialist Kaiser PermanenteChengyi Zheng, PhD, is a NLP specialist at the Kaiser Permanente Southern California. He has worked on over 30 research projects using the electronic health records (EHR) data from millions of patients. He is the principal investigator of a CDC funded study involving 5 health care institutions on using NLP in the vaccine safety studies. He was the winner of the Kaiser Permanente predictive modeling competition. He ranked the 1st place in the innovation competition (InnoCentive@Lilly) while served as the biomedical informatics scientist at Eli Lilly. He was trained in computer science with a concentration on speech recognition. He will share some experiences on using NLP and Machine learning on EHR for outcomes prediction.
Eugene Myshkin, Ph.D., Senior Research Scientist, ClarivateEugene Myshkin, PhD, is a senior scientist in bioinformatics at Clarivate Analytics. He
has over 15 years experience in drug discovery, cheminformatics and bioinformatics. He
has also been involved in a number of text mining projects including mining of chemical
reagents and antibodies from scientific
literature.
September 14, 2017 NLP and ML
© P
isto
ia A
llia
nce
Agenda
6
• AI, NLP and ML (David)
• Using NLP and ML in clinical research (Chengyi)
• Network and pathway driven machine learning
approaches to biomarker discovery and patient
stratification (Eugene)
6September 14, 2017 NLP and ML
NLP, AI and Machine Learning
David Milward, PhD
CTO, Linguamatics
2017
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
DLAI
ML
NLP DS
© 2017 Linguamatics8
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
AI
© 2017 Linguamatics9
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
AI
NLP
© 2017 Linguamatics10
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
DLAI
ML
© 2017 Linguamatics11
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
AI
NLP DS
© 2017 Linguamatics12
Overview
AI (Artificial Intelligence)NLP (Natural Language Processing)
− and its applications in life sciences
ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP
AI
ML
NLP
© 2017 Linguamatics13
Artificial Intelligence (AI)
Artificial intelligence is intelligence exhibited by machines
The central goals of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects
As machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, leading to the quip “AI is whatever hasn't been done yet”
Wikipedia
© 2017 Linguamatics14
Natural Language Processing (NLP)
Processing of natural languages e.g. English, French, Chinese by computers
NLP is part of AI, but also key to other areas of AI e.g. providing decision support
− If 80% of knowledge is unstructured
we need NLP to get the right information
to provide good suggestions
− Currently many AI projects are limited: they
can only address questions where there is
structured data
− Worse, they often use inappropriate
structured data such as ICD billing codes for
non-billing tasks
© 2017 Linguamatics15
Find information however it is expressed
© 2017 Linguamatics16
Different word, same meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same meaning
5mg/kg of cyclosporine daily
5mg/kg/d of cyclosporine
cyclosporine 5mg/kg/day
Same word, different context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
Represent it in a standardized form
© 2017 Linguamatics17
Concept Text Normalized Value
Diseases breast cancer Breast Neoplasm
carcinoma of the breast
Genes Raf-1 RAF1
Raf I
Dates 27th Feb 2014 20140227
2014/02/27
Measurements 0.2g 200 mg
Two hundred milligrams
Mutations Val 158 Met V158M
Val by Met at codon 158
Entrez Gene ID: 5743inhibits
nimesulide, a selective COX2 inhibitor, …
From Bench to Bedside: NLP Provides Insight
© 2017 Linguamatics18
Regulatoryapproval
Phase 3Clinicaltrials
Basic research
Idea Patientcare
Phase 2Phase 1
DeliveryDevelopmentDiscovery
Business critical questions
What targets are involved in bone cancer?
What companies are patenting a particular technology?
What are the safety risks of my drug?
Where can I site my Phase 1, Phase 3 clinical study?
What are the clinical risks for my patients?
Direct access to the Unstructured
© 2017 Linguamatics
Weight ≥ 80kg
Below 60 years old
Reports after 2010
With mutation C677T
Cancer patients
19
Machine Learning
Machine Learning is used for AI in general and as a technique within NLP
3 main flavours:
− Supervised
− uses annotated data mapping between inputs and outputs
− Semi-supervised
− uses machine analysis but incorporates a human in the loop
− Unsupervised
− uses unannotated data, usually at very large scale.
© 2017 Linguamatics20
Recent successes with deep learning approaches based on neural networks for supervised and unsupervised ML e.g.
− Machine translation using parallel
corpora
− Image classification in medicine
Using NLP to feed other AI
NLP provides access to the 80% of information in unstructured text
Provides a set of potential features to be used in e.g. ML models for Decision Support
Example: building risk models from RWD sets
− Predicting patients at risk of misusing opioid
prescription drugs (AMIA November 2017)
− Features extracted by Linguamatics I2E from
8.9 million de-identified medical record full-
text transcripts from RealHealthData
− SVM classifier trained on the features to flag
patients at risk
© 2017 Linguamatics21
Machine Learning in NLP
Supervised ML − Requires large-scale, representative annotated documents
− Main paradigm for core NLP components
− For extraction patterns, used in academic systems but less commonly
in commercial
Semi-supervised ML − Useful for new tasks or data sets where no existing representative
annotated data
− Useful where a task is initially ill-defined
− Puts a human in the loop judging suggestions from the machine
learning
− Can provide good quality results quickly e.g. to test whether a feature
extracted by NLP is useful for a ML model
Unsupervised ML − Uses large-scale unannotated data
− Key example is learning the meaning of a word via the context it
keeps (word embeddings)
© 2017 Linguamatics22
Semi-Supervised ML Approaches
Similar distributions for words and syntactic constructions
Automatically discover what is in the data using an interactive, agile text mining platform such as Linguamatics I2E
A long tail of infrequent cases
− prioritize the more frequent constructions
− generalize to cover items in the tail
© 2017 Linguamatics23
Zipf’s Law: the frequency of any word is inversely proportional to its rank in the frequency table
Semi-Supervised NLP using Linguamatics I2E
© 2017 Linguamatics24
Summary
NLP is critical to success of many ML projects
− access to the unstructured text is key to using ML
widely, not just where there is convenient structured
data
Semi-supervised approaches to NLP provide an efficient way to capture features for ML projects
© 2017 Linguamatics25
DLAI
ML
NLP DS
Poll Question 2: What is your company’s
primary use for NLP?
A. Early Discovery/ Pre-clinical
B. Clinical
C. Real world data
D. Other
E. Don’t use NLP
Using NLP and ML in clinical researchChengyi Zheng, PhD, MS
DEPARTMENT of Research & Evaluation
28 DEPARTMENT of Research & Evaluation
10/6/2012 10/19/2012
10/7/2012 10/14/2012
10/7/2012
Pt called
10/7/2012
Nurse Called Back
10/8/2012
Orthopedic office visitWhere: Medical Center, Department
10/8/2012
Progress Notes:Reason for visit: Knee Pain
Vital Sign/BMI/Pain level/HistoryPE/Findings/Impression/A&P
Dx: icd-9 codeNurse Exam Note:
…
10/9/2012
Lab
10/10/2012
Pre-op dental exam (ext)
10/6/2012
Imaging:DEXA Bone density
10/11/2012
office visit
10/11/2012
Rx Prescribed
10/10/2012
Surgery Scheduled
10/11/2012
Office VisitSinus CongestionAnkle itchyDx: 401.9 Essential Hypertension274.9 Gout461.9 Acute Sinusitis
10/12/2012
Picked up the Rx
10/13/2012
Pt missed appt.
10/13/2012
Telephone ConsultHealthy bones PN
10/14/2012
Pt emailed:Drug adverse event
10/14/2012
Pt calledcancerous area
10/15/2012
EKGDx: Screening
10/15/2012
Ear Wax Wash
10/18/2012
Pathology Report Out
10/16/2012
Procedure:Remove Skin
10/16/2012 - 10/19/2012
Hospitalization
Two weeks records of a patient in an EMR system
5 Ws: What, Who, When, Where and Why
Membership length: 70% > 5 years, 50% >10 years.
29
5 Ws: What, Who, When, Where and Why What
– What is the reason of visit?
– What happened? (pain after fall, pain after drink a beer?)
Who
– Who is the caregiver? (primary physician, rheumatologist?)
– What we know about this patient? (age, race, past medical history, et. al.)
Where
– Where this visit occurred?
When
– When the problem started?
Why
– Why this problem happened? Possible causes?
DEPARTMENT of Research & Evaluation
30
Visual representation of KPSC research databases
DEPARTMENT of Research & Evaluation
31
Case study: Identify acute gout flare
Published methods to identify gout flares using claims data
– Clinical coding is unreliable: under-coding, over-coding, too general
– Medication is unreliable:
Drugs for gout maintenance
Drugs also for other diseases (Share similar symptoms)
NLP has been used to:
– Identify study population and patients information
– Identify and extract clinical variables (genetic, biopsy, radiology)
– Evaluate patients status (disease progression, medication status)
DEPARTMENT of Research & Evaluation
Solution and challenges (NLP)
Challenges:
– Gout is a chronic disease which can be controlled but not cured
Signs and symptoms could appeared in follow up visit
Differentiate between acute and chronic status
– Gout population is generally old with comorbidity sharing similar symptoms
100+ types of arthritis (> 50 million people)
Pain, erythema, and swelling joint
– Information documented varies by clinical notes
Standard solutions:
– Each search query captures one set of information
– Each search query has its own sensitivity/specificity etc.
– Logic operator combines search results (union, join, etc.)
Difficult to optimize on the overall sensitivity/specificity etc.
32 DEPARTMENT of Research & Evaluation
Mining vs. NLP & ML in clinical research
Steps:
1. Preliminary analysis, estimate feasibility
2. Develop plan, estimate cost
3. Seek permit (government vs. IRB)
4. Mine (mining equipment vs. NLP)
– Focus on completeness (high sensitivity)
– Shallow & deep mining (good specificity)
5. Refine (chemical process vs. ML)
– Improve purity (higher specificity)
6. Manual verification (optional)
7. Deliver to customer“art and science combined” “resource-heavy and time-consuming
process” 33 DEPARTMENT of Research & Evaluation
Solution and challenges (NLP+ML)
Goal:
NLP focus on sensitivity or information completeness
– Separate ores from rock
ML focus on improving the specificity
– Improve purity without much loss of sensitivity
Solution:
NLP results as input features to the ML system
– Identify related signs and symptoms
– Identify temporal relationship (when and how long?)
– Identify disease association (related to any other disease?)
– Identify implicit and explicit mention of gout flare
– Identify treatment plan associated with disease onset
34 DEPARTMENT of Research & Evaluation
Overview of the system development steps
35
Study period: 1/1/2007 to 12/31/2010. Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy. Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients.
DEPARTMENT of Research & Evaluation
Overview of the NLP+ML system
36 DEPARTMENT of Research & Evaluation
Performance comparisons
81.1
95.488.3
92.290.997.3
93 96.5
84.892.2
81.1
93.9
70
80
90
100
Sensitivity Specificity PPV NPV
Clinical note level gout flare identification
Rheumatologist 1 Rheumatologist 2 NLP+ML
37
98.592.9
97.1 96.397.192.9
97.192.9
98.5 96.4 98.5 96.4
88.2 89.395.2
75.8
70
80
90
100
Sensitivity Specificity PPV NPV
Identify patients with ≥ 1 gout flares
Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9
74.2
92.382.1 88.283.9
95.4 89.7 92.593.584.6
74.4
96.5
41.9
95.481.3 77.5
30
50
70
90
Sensitivity Specificity PPV NPV
Identify patients with ≥ 3 gout flares (refractory gout)
Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9
DEPARTMENT of Research & Evaluation
Results
Note level (gout flare, n= 599,317):
– NLP: 49,415 positive cases => ML: 18,869 positive cases
Patient level (with ≥ 3 flares, n=16,519):
– Number of patients: 1,402 (NLP+ML) vs. 516 (Claim)
– Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)
Impact:
– Identify refractory disease patients
– Estimate market size (KPSC / US population = 4.5/325 million =1.4%)
– Better disease management, improve quality of life, and help reduce healthcare resource use.
1,402 patients is more manageable than 16,519 patients
38 DEPARTMENT of Research & Evaluation
39
ML in healthcare
Tremendous opportunities
Prediction: high utilizers, risk scores
Identification: cases, outcomes, social needs
Image recognition: pathology and radiology images
– Challenges (Data)
Data quality: dirty, missing data
Heterogeneous data: different systems
Structured, semi-structured and free text data
Image, scanned documents
Genetic and biobank data
– Challenges (People)
Who understands NLP, ML and healthcare
Who understands the complexity of healthcare data
DEPARTMENT of Research & Evaluation
Poll Question 3: How does your company
primarily use machine learning in drug
discovery?
A. Target prediction and repositioning
B. Biomarker discovery
C. Patient stratification
D. Other
E. We don’t use machine learning
Network and pathway
driven machine learning
approaches to biomarker
discovery and patient
stratification
Eugene Myshkin, PhD
September 2017
42CLARIVATE ANALYTICS TEXT MINING
• Clarivate Analytics literature data feed• Comprehensive coverage
– >20,000 journals
– Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts;
Derwent Drug File
– http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER
• Latest information– Updated with over 170,921 articles/month, or 2,051,051+ articles/year
• Full text, cover to cover searching of all journals
• Comprehensive synonym collections
• Controlled vocabulary management software to support mining
43
CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS
Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring
Concepts in social media Automated Curation of Clinical
Data
Protein and Gene Variant
Monitoring
44
USING NLP FOR MANUSCRIPT MATCHING
Analyze citation connections to place the publication in the right journal
45
DRUG TARGET DISEASE
PITFALLS OF NLP FEATURES FOR ML• 1-10 million of features• Feature vectors are binary and sparse• Feature redundancy• Feature selection takes a long time
These associations can be obtained with NLP but precision is a problem -a flood of false positives and the necessity to hire a bunch of people just to sort the true from the false alerts.
FOCUS OF DRUG DISCOVERY:
46
—
METABASE MANUALLY ANNOTATED CONTENT
PUBLICATIONS
(209 for EGF-EGFR interaction)
•Manual annotation from publications•Team of PhDs, MDs•Advanced editorial systems•Controlled vocabularies•Multiple levels of QC•invested more than 400 man years MOLECULAR
INTERACTION
NETWORK:
PATHWAY
~ 1,500,000 molecular interactions
~ 3,000 pathways
47
—
INTEGRATED APPROACH
Pathway knowledgePathway-driven
approaches
Statistical approaches
1. Target identification or repositioning2. Biomarker discovery3. Patient stratification
48
—
Drug toxic but beneficial
Drug toxic but NOT beneficial
Drug NOT toxic and beneficial
Drug NOT toxic and NOT beneficial
Patient stratification
“The most efficient and safe drug for a cohort of patients”
WHY DIFFERENT PATIENT RESPONSE?
Blockbuster strategy
“One drug for all patients”
New strategy is needed
49
—HOW CAN PATIENTS BE STRATIFIED?
Mechanism 1 Mechanism 2
Biomarkers Biomarkers
Biomarker – measurable molecular indicator of:disease subtype/progress
drug efficacyside effect/toxicity
• Identify subtypes resulting in multiple drug targets rather than one.
• A shift from the presumption of a disease to multiple diseases would reframe the drug development strategy
50
—
ORION BIONETWORKS
Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading organizations in patient care, computational modelling, translational research and patient advocacy that aims to develop open-source computational models for multiple sclerosis and improve upon existing analytical tools for model development.
~186 subjects with gene expression data and clinical parameters like time to relapse, etc
GOALS:
Understand the structure of the population based on
the molecular data – identify cohorts of patients whose
clinical course differs over time
Build stratification models
Identify new therapeutic targets
51
—
NETWORK/PATHWAY BASED METHODS FOR BIOMARKER DISCOVERY
52
—
1. PATHWAY IDENTIFICATION
— 56 pathways identified
• 136 genes
• 39/136 genes were present in multiple pathways
• 44/136 genes known MS biomarkers or drug targets (p =
5x10-6)
52
• individual expression values of each member gene were averaged into a combined z-score
• activity score association with time to relapse in a Cox proportional hazard model was calculated
53
—
2. PATIENTS CLUSTERING BY PATHWAYS
Clusters are significantly associated with time to relapse in the presence of important clinical covariates
patients were clustered into groups based on k-means clustering of their pathway activity profiles, k=3 resulted in the best separation of patient profiles.
54
—
— A K-Nearest Neighbor model was previously generated to predict
risk groups 1-3 using all biomarkers
— Feature selection was performed by taking the variable importance
calculated from the trained KNN model.
— Forward feature selection was then conducted using 10-fold CV
adding features to the model in order of their importance.
— Once this process was complete the predictive performance was
evaluated in terms of the ability of the model to separate the three
risk groups
— Final feature set was applied to test data
3. CLASSIFICATION MODEL
Signature was reduced from 56 to 13 pathways, containing 65 genes
GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA
PATHWAY BASED APPROACH GENE BASED APPROACH
56
—
CONCLUSIONS
— Signature differentiating between patient cohorts was reduced
from 56 to 13 pathways
— This new signature contains 65 genes
— 13 biomarkers could stratify subjects into risk groups with
statistically significant differences in time to relapse
— This was validated in test subjects with results being consistent
to what was observed in the training cohort
— Pathway activities were more robust than gene expression
56
Poll Question 4: What is the greatest
barrier to application of NLP/ML at your
company?
A. Technical expertise
B. Access to data
C. Data quality
D. Management support/understanding
E. Other
Poll Question 5: Do you expect an
increase in ML within Life Science in the
next 2 years?
A. Yes
B. No
C: Don’t Know
Audience Q&APlease use the Question function in GoToWebinar
Where will AI/Deep learning
have an impact in Life Science
& Health?
The next Pistoia Alliance Debates Webinar:
Moderator: Nick Lynch with Sean Ekins CEO, Collaborations
Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock,
Pfizer Research
Date: September 27, 2017
check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-
series/ for the latest information
[email protected] @pistoiaalliance www.pistoiaalliance.org
Top Related