Download - NLP & ML Webinar

This webinar is being recorded

Natural Language

Processing and Machine

Learning: Beyond the Hype

A Pistoia Alliance Debates Webinar

Moderated by David Milward –Linguamatics

September 14, 2017

This webinar is being recorded

Poll Question 1: What role do you play in

your company?

A. IT

B. Data scientist/bioinformatician

C. Clinical/bench scientist

D. Information professional

E. Other

© P

isto

ia A

llia

nce

The Panel

5

David Milward, Ph.D., CTO LinguamaticsDavid Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Chengyi Zheng, Ph.D. , NLP Specialist Kaiser PermanenteChengyi Zheng, PhD, is a NLP specialist at the Kaiser Permanente Southern California. He has worked on over 30 research projects using the electronic health records (EHR) data from millions of patients. He is the principal investigator of a CDC funded study involving 5 health care institutions on using NLP in the vaccine safety studies. He was the winner of the Kaiser Permanente predictive modeling competition. He ranked the 1st place in the innovation competition (InnoCentive@Lilly) while served as the biomedical informatics scientist at Eli Lilly. He was trained in computer science with a concentration on speech recognition. He will share some experiences on using NLP and Machine learning on EHR for outcomes prediction.

Eugene Myshkin, Ph.D., Senior Research Scientist, ClarivateEugene Myshkin, PhD, is a senior scientist in bioinformatics at Clarivate Analytics. He

has over 15 years experience in drug discovery, cheminformatics and bioinformatics. He

has also been involved in a number of text mining projects including mining of chemical

reagents and antibodies from scientific

literature.

September 14, 2017 NLP and ML

© P

isto

ia A

llia

nce

Agenda

6

• AI, NLP and ML (David)

• Using NLP and ML in clinical research (Chengyi)

• Network and pathway driven machine learning

approaches to biomarker discovery and patient

stratification (Eugene)

6September 14, 2017 NLP and ML

NLP, AI and Machine Learning

David Milward, PhD

CTO, Linguamatics

2017

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

DLAI

ML

NLP DS

© 2017 Linguamatics8

Overview




AI


Overview




AI

NLP


Overview




DLAI

ML


Overview




AI

NLP DS


Overview




AI

ML

NLP


Artificial Intelligence (AI)

Artificial intelligence is intelligence exhibited by machines

The central goals of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects

As machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, leading to the quip “AI is whatever hasn't been done yet”

Wikipedia


Natural Language Processing (NLP)

Processing of natural languages e.g. English, French, Chinese by computers

NLP is part of AI, but also key to other areas of AI e.g. providing decision support

− If 80% of knowledge is unstructured

we need NLP to get the right information

to provide good suggestions

− Currently many AI projects are limited: they

can only address questions where there is

structured data

− Worse, they often use inappropriate

structured data such as ICD billing codes for

non-billing tasks


Find information however it is expressed


Different word, same meaning

cyclosporine

ciclosporin

Neoral

Sandimmune

Different expression, same meaning

Non-smoker

Does not smoke

Does not drink or smoke

Denies tobacco use

Different grammar, same meaning

5mg/kg of cyclosporine daily

5mg/kg/d of cyclosporine

cyclosporine 5mg/kg/day

Same word, different context

Diagnosed with diabetes

Family history of diabetes

No family history of diabetes

NLP

Represent it in a standardized form


Concept Text Normalized Value

Diseases breast cancer Breast Neoplasm

carcinoma of the breast

Genes Raf-1 RAF1

Raf I

Dates 27th Feb 2014 20140227

2014/02/27

Measurements 0.2g 200 mg

Two hundred milligrams

Mutations Val 158 Met V158M

Val by Met at codon 158

Entrez Gene ID: 5743inhibits

nimesulide, a selective COX2 inhibitor, …

From Bench to Bedside: NLP Provides Insight


Regulatoryapproval

Phase 3Clinicaltrials

Basic research

Idea Patientcare

Phase 2Phase 1

DeliveryDevelopmentDiscovery

Business critical questions

What targets are involved in bone cancer?

What companies are patenting a particular technology?

What are the safety risks of my drug?

Where can I site my Phase 1, Phase 3 clinical study?

What are the clinical risks for my patients?

Direct access to the Unstructured

© 2017 Linguamatics

Weight ≥ 80kg

Below 60 years old

Reports after 2010

With mutation C677T

Cancer patients

19

Machine Learning

Machine Learning is used for AI in general and as a technique within NLP

3 main flavours:

− Supervised

− uses annotated data mapping between inputs and outputs

− Semi-supervised

− uses machine analysis but incorporates a human in the loop

− Unsupervised

− uses unannotated data, usually at very large scale.


Recent successes with deep learning approaches based on neural networks for supervised and unsupervised ML e.g.

− Machine translation using parallel

corpora

− Image classification in medicine

Using NLP to feed other AI

NLP provides access to the 80% of information in unstructured text

Provides a set of potential features to be used in e.g. ML models for Decision Support

Example: building risk models from RWD sets

− Predicting patients at risk of misusing opioid

prescription drugs (AMIA November 2017)

− Features extracted by Linguamatics I2E from

8.9 million de-identified medical record full-

text transcripts from RealHealthData

− SVM classifier trained on the features to flag

patients at risk


Machine Learning in NLP

Supervised ML − Requires large-scale, representative annotated documents

− Main paradigm for core NLP components

− For extraction patterns, used in academic systems but less commonly

in commercial

Semi-supervised ML − Useful for new tasks or data sets where no existing representative

annotated data

− Useful where a task is initially ill-defined

− Puts a human in the loop judging suggestions from the machine

learning

− Can provide good quality results quickly e.g. to test whether a feature

extracted by NLP is useful for a ML model

Unsupervised ML − Uses large-scale unannotated data

− Key example is learning the meaning of a word via the context it

keeps (word embeddings)


Semi-Supervised ML Approaches

Similar distributions for words and syntactic constructions

Automatically discover what is in the data using an interactive, agile text mining platform such as Linguamatics I2E

A long tail of infrequent cases

− prioritize the more frequent constructions

− generalize to cover items in the tail


Zipf’s Law: the frequency of any word is inversely proportional to its rank in the frequency table

https://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjx1_HThrrSAhVCPxoKHf3uCr0QjRwIBw&url=https://www.linkedin.com/pulse/zipfs-law-dental-scientific-literature-milo%C5%A1-radulovi%C4%87&psig=AFQjCNHJWSK1sha20HXII7VCQsXKPgAAhQ&ust=1488620721061189

https://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjx1_HThrrSAhVCPxoKHf3uCr0QjRwIBw&url=https://www.linkedin.com/pulse/zipfs-law-dental-scientific-literature-milo%C5%A1-radulovi%C4%87&psig=AFQjCNHJWSK1sha20HXII7VCQsXKPgAAhQ&ust=1488620721061189

Semi-Supervised NLP using Linguamatics I2E


Summary

NLP is critical to success of many ML projects

− access to the unstructured text is key to using ML

widely, not just where there is convenient structured

data

Semi-supervised approaches to NLP provide an efficient way to capture features for ML projects


DLAI

ML

NLP DS

Poll Question 2: What is your company’s

primary use for NLP?

A. Early Discovery/ Pre-clinical

B. Clinical

C. Real world data

D. Other

E. Don’t use NLP

Using NLP and ML in clinical researchChengyi Zheng, PhD, MS

DEPARTMENT of Research & Evaluation

28 DEPARTMENT of Research & Evaluation

10/6/2012 10/19/2012

10/7/2012 10/14/2012

10/7/2012

Pt called

10/7/2012

Nurse Called Back

10/8/2012

Orthopedic office visitWhere: Medical Center, Department

10/8/2012

Progress Notes:Reason for visit: Knee Pain

Vital Sign/BMI/Pain level/HistoryPE/Findings/Impression/A&P

Dx: icd-9 codeNurse Exam Note:

…

10/9/2012

Lab

10/10/2012

Pre-op dental exam (ext)

10/6/2012

Imaging:DEXA Bone density

10/11/2012

office visit

10/11/2012

Rx Prescribed

10/10/2012

Surgery Scheduled

10/11/2012

Office VisitSinus CongestionAnkle itchyDx: 401.9 Essential Hypertension274.9 Gout461.9 Acute Sinusitis

10/12/2012

Picked up the Rx

10/13/2012

Pt missed appt.

10/13/2012

Telephone ConsultHealthy bones PN

10/14/2012

Pt emailed:Drug adverse event

10/14/2012

Pt calledcancerous area

10/15/2012

EKGDx: Screening

10/15/2012

Ear Wax Wash

10/18/2012

Pathology Report Out

10/16/2012

Procedure:Remove Skin

10/16/2012 - 10/19/2012

Hospitalization

Two weeks records of a patient in an EMR system

5 Ws: What, Who, When, Where and Why

Membership length: 70% > 5 years, 50% >10 years.

29

5 Ws: What, Who, When, Where and Why What

– What is the reason of visit?

– What happened? (pain after fall, pain after drink a beer?)

Who

– Who is the caregiver? (primary physician, rheumatologist?)

– What we know about this patient? (age, race, past medical history, et. al.)

Where

– Where this visit occurred?

When

– When the problem started?

Why

– Why this problem happened? Possible causes?


30

Visual representation of KPSC research databases


31

Case study: Identify acute gout flare

Published methods to identify gout flares using claims data

– Clinical coding is unreliable: under-coding, over-coding, too general

– Medication is unreliable:

Drugs for gout maintenance

Drugs also for other diseases (Share similar symptoms)

NLP has been used to:

– Identify study population and patients information

– Identify and extract clinical variables (genetic, biopsy, radiology)

– Evaluate patients status (disease progression, medication status)


Solution and challenges (NLP)

Challenges:

– Gout is a chronic disease which can be controlled but not cured

Signs and symptoms could appeared in follow up visit

Differentiate between acute and chronic status

– Gout population is generally old with comorbidity sharing similar symptoms

100+ types of arthritis (> 50 million people)

Pain, erythema, and swelling joint

– Information documented varies by clinical notes

Standard solutions:

– Each search query captures one set of information

– Each search query has its own sensitivity/specificity etc.

– Logic operator combines search results (union, join, etc.)

Difficult to optimize on the overall sensitivity/specificity etc.


Mining vs. NLP & ML in clinical research

Steps:

1. Preliminary analysis, estimate feasibility

2. Develop plan, estimate cost

3. Seek permit (government vs. IRB)

4. Mine (mining equipment vs. NLP)

– Focus on completeness (high sensitivity)

– Shallow & deep mining (good specificity)

5. Refine (chemical process vs. ML)

– Improve purity (higher specificity)

6. Manual verification (optional)

7. Deliver to customer“art and science combined” “resource-heavy and time-consuming

process” 33 DEPARTMENT of Research & Evaluation

Solution and challenges (NLP+ML)

Goal:

NLP focus on sensitivity or information completeness

– Separate ores from rock

ML focus on improving the specificity

– Improve purity without much loss of sensitivity

Solution:

NLP results as input features to the ML system

– Identify related signs and symptoms

– Identify temporal relationship (when and how long?)

– Identify disease association (related to any other disease?)

– Identify implicit and explicit mention of gout flare

– Identify treatment plan associated with disease onset


Overview of the system development steps

35

Study period: 1/1/2007 to 12/31/2010. Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy. Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients.


Overview of the NLP+ML system


Performance comparisons

81.1

95.488.3

92.290.997.3

93 96.5

84.892.2

81.1

93.9

70

80

90

100

Sensitivity Specificity PPV NPV

Clinical note level gout flare identification

Rheumatologist 1 Rheumatologist 2 NLP+ML

37

98.592.9

97.1 96.397.192.9

97.192.9

98.5 96.4 98.5 96.4

88.2 89.395.2

75.8

70

80

90

100


Identify patients with ≥ 1 gout flares

Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9

74.2

92.382.1 88.283.9

95.4 89.7 92.593.584.6

74.4

96.5

41.9

95.481.3 77.5

30

50

70

90


Identify patients with ≥ 3 gout flares (refractory gout)

Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9


Results

Note level (gout flare, n= 599,317):

– NLP: 49,415 positive cases => ML: 18,869 positive cases

Patient level (with ≥ 3 flares, n=16,519):

– Number of patients: 1,402 (NLP+ML) vs. 516 (Claim)

– Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)

Impact:

– Identify refractory disease patients

– Estimate market size (KPSC / US population = 4.5/325 million =1.4%)

– Better disease management, improve quality of life, and help reduce healthcare resource use.

1,402 patients is more manageable than 16,519 patients


39

ML in healthcare

Tremendous opportunities

Prediction: high utilizers, risk scores

Identification: cases, outcomes, social needs

Image recognition: pathology and radiology images

– Challenges (Data)

Data quality: dirty, missing data

Heterogeneous data: different systems

Structured, semi-structured and free text data

Image, scanned documents

Genetic and biobank data

– Challenges (People)

Who understands NLP, ML and healthcare

Who understands the complexity of healthcare data


Poll Question 3: How does your company

primarily use machine learning in drug

discovery?

A. Target prediction and repositioning

B. Biomarker discovery

C. Patient stratification

D. Other

E. We don’t use machine learning

Network and pathway

driven machine learning

approaches to biomarker

discovery and patient

stratification

Eugene Myshkin, PhD

September 2017

42CLARIVATE ANALYTICS TEXT MINING

• Clarivate Analytics literature data feed• Comprehensive coverage

– >20,000 journals

– Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts;

Derwent Drug File

– http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER

• Latest information– Updated with over 170,921 articles/month, or 2,051,051+ articles/year

• Full text, cover to cover searching of all journals

• Comprehensive synonym collections

• Controlled vocabulary management software to support mining

43

CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS

Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring

Concepts in social media Automated Curation of Clinical

Data

Protein and Gene Variant

Monitoring

44

USING NLP FOR MANUSCRIPT MATCHING

Analyze citation connections to place the publication in the right journal

45

DRUG TARGET DISEASE

PITFALLS OF NLP FEATURES FOR ML• 1-10 million of features• Feature vectors are binary and sparse• Feature redundancy• Feature selection takes a long time

These associations can be obtained with NLP but precision is a problem -a flood of false positives and the necessity to hire a bunch of people just to sort the true from the false alerts.

FOCUS OF DRUG DISCOVERY:

46

—

METABASE MANUALLY ANNOTATED CONTENT

PUBLICATIONS

(209 for EGF-EGFR interaction)

•Manual annotation from publications•Team of PhDs, MDs•Advanced editorial systems•Controlled vocabularies•Multiple levels of QC•invested more than 400 man years MOLECULAR

INTERACTION

NETWORK:

PATHWAY

~ 1,500,000 molecular interactions

~ 3,000 pathways

47

—

INTEGRATED APPROACH

Pathway knowledgePathway-driven

approaches

Statistical approaches

1. Target identification or repositioning2. Biomarker discovery3. Patient stratification

48

—

Drug toxic but beneficial

Drug toxic but NOT beneficial

Drug NOT toxic and beneficial

Drug NOT toxic and NOT beneficial

Patient stratification

“The most efficient and safe drug for a cohort of patients”

WHY DIFFERENT PATIENT RESPONSE?

Blockbuster strategy

“One drug for all patients”

New strategy is needed

49

—HOW CAN PATIENTS BE STRATIFIED?

Mechanism 1 Mechanism 2

Biomarkers Biomarkers

Biomarker – measurable molecular indicator of:disease subtype/progress

drug efficacyside effect/toxicity

• Identify subtypes resulting in multiple drug targets rather than one.

• A shift from the presumption of a disease to multiple diseases would reframe the drug development strategy

50

—

ORION BIONETWORKS

Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading organizations in patient care, computational modelling, translational research and patient advocacy that aims to develop open-source computational models for multiple sclerosis and improve upon existing analytical tools for model development.

~186 subjects with gene expression data and clinical parameters like time to relapse, etc

GOALS:

Understand the structure of the population based on

the molecular data – identify cohorts of patients whose

clinical course differs over time

Build stratification models

Identify new therapeutic targets

51

—

NETWORK/PATHWAY BASED METHODS FOR BIOMARKER DISCOVERY

52

—

1. PATHWAY IDENTIFICATION

— 56 pathways identified

• 136 genes

• 39/136 genes were present in multiple pathways

• 44/136 genes known MS biomarkers or drug targets (p =

5x10-6)

52

• individual expression values of each member gene were averaged into a combined z-score

• activity score association with time to relapse in a Cox proportional hazard model was calculated

53

—

2. PATIENTS CLUSTERING BY PATHWAYS

Clusters are significantly associated with time to relapse in the presence of important clinical covariates

patients were clustered into groups based on k-means clustering of their pathway activity profiles, k=3 resulted in the best separation of patient profiles.

54

—

— A K-Nearest Neighbor model was previously generated to predict

risk groups 1-3 using all biomarkers

— Feature selection was performed by taking the variable importance

calculated from the trained KNN model.

— Forward feature selection was then conducted using 10-fold CV

adding features to the model in order of their importance.

— Once this process was complete the predictive performance was

evaluated in terms of the ability of the model to separate the three

risk groups

— Final feature set was applied to test data

3. CLASSIFICATION MODEL

Signature was reduced from 56 to 13 pathways, containing 65 genes

GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA

PATHWAY BASED APPROACH GENE BASED APPROACH

56

—

CONCLUSIONS

— Signature differentiating between patient cohorts was reduced

from 56 to 13 pathways

— This new signature contains 65 genes

— 13 biomarkers could stratify subjects into risk groups with

statistically significant differences in time to relapse

— This was validated in test subjects with results being consistent

to what was observed in the training cohort

— Pathway activities were more robust than gene expression

56

Poll Question 4: What is the greatest

barrier to application of NLP/ML at your

company?

A. Technical expertise

B. Access to data

C. Data quality

D. Management support/understanding

E. Other

Poll Question 5: Do you expect an

increase in ML within Life Science in the

next 2 years?

A. Yes

B. No

C: Don’t Know

Audience Q&APlease use the Question function in GoToWebinar

Where will AI/Deep learning

have an impact in Life Science

& Health?

The next Pistoia Alliance Debates Webinar:

Moderator: Nick Lynch with Sean Ekins CEO, Collaborations

Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock,

Pfizer Research

Date: September 27, 2017

check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-

series/ for the latest information

http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-series/

[email protected] @pistoiaalliance www.pistoiaalliance.org