0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr...
-
Upload
archibald-martin -
Category
Documents
-
view
219 -
download
1
Transcript of 0 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr...
1
Unstructured Machine Learning:
Providing the link between Genetic Data and Published Research
Dr Tony C SmithDr Tony C Smith
Reel Two, Inc.Reel Two, Inc.9 Hartley Street9 Hartley Street
Hamilton, New ZealandHamilton, New Zealand+64 7 839 7808+64 7 839 7808
www.reeltwo.comwww.reeltwo.com
2
What is Machine Learning?
creating computer programs that get better with experience
learn how to make expert judgments
discover previously hidden, potentially useful information (data mining)
How does it work?
user provides learning system with examples of concept to be learned
induction algorithm infers a characteristic model of the examples
model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!
3
Structured Learning
WeightWeight DamageDamage DirtDirt FirmnessFirmness QualityQuality
heavy high mild hard poorheavy high mild soft poornormal high mild hard goodlight medium mild hard goodLight clear clean hard goodnormal clear clean soft poorheavy medium mild hard poor. . .
Mushroom DataMushroom Data
weightweight
goodgooddirtdirt firmnessfirmness
poorpoor
heavyheavy lightlight normalnormal
mildmild cleanclean hardhard softsoft
poorpoorgoodgood goodgood
4
Unstructured Learning
data does not have fixed fields with specific values
examples: images, continuous signals, expression data, text
learning proceeds by correlating the presence or absence of any and all salient attributes
Document Classification
given examples of documents covering some topic, learn a semantic model that can recognize whether or not other documents are relevant
prioritize them: i.e. quantify “how relevant” documents are to the topic
not limited to keywords (nor is it misled by them)
adapt to the user’s needs (ephemeral or long-term)
5
How Text Mining Works
Users supply the system with training data• Documents that are good examples of the desired category
The system builds ‘classifiers’• Statistical models based on the training data
The system classifies novel data• Identifies other documents about the desired category
Results are displayed or stored• Files can be viewed, routed to end users or stored in databases
6
Classification System
Client-specific categoriesFamiliar Windows-style interface
Drag-and-drop documents to create custom categories
Classified documents are ranked by relevance
View contents of individual documents – sentences are highlighted by their relevance to
the category
7
The Initial Problem: Individual curators evaluate data differently
ProteinModification
MAPK-KKCascade
Activation of p38 MAP Kinase
While scientists can agree to use the word "kinase," they must also agree to support this by stating how and why they use "kinase," and consistently apply it. Only in this way can they hope to compare gene products and find out if and how they are related.
The Gene Ontology – A Good First Step
The Initial Solution: The Gene Ontology (GO) – A controlled vocabulary with defined relationships between items.
GO consists of more than 13,000 nodes, or ‘GO Terms’, divided into three main trees: Biological Process, Cellular Component and Molecular Function
Of these, only about 3800 GO Terms are ‘active’ – that is, terms appended with more than just one or two publications.
8
The Gene Ontology Knowledge Discovery System
• GO KDS) bridges the gap by classifying all of MEDLINE. • New documents are classified as they’re added• Scientists can now annotate gene targets quickly and reliably• GO KDS is updated along with GO and MEDLINE
• Enormous gap between GO-annotated docs (27,000) and full MEDLINE database (12 million entries). • Updates lag behind.• Scientists must understand and agree to use the GO• Knowledge changes and alters definitions.
GO is only a partial solution GO KDS – Filling the gaps in GO
Using GO “as is” takes too
long and delivers too
little
9
Current GO term(s) open
Location of listed term in
GO
All sub-terms for the listed term: click on a term to further refine
your search
Enter a keyword to search in this GO category
Opens abstract in separate window
Color of stars identifies the GO branch: number of
stars indicates confidence of category placement
Original GO classifications
(by domain-expert)
KDS discovers novel classifications
GO KDS Interface TourGO KDS Interface Tour
10
GO KDS Key Benefits
Quickly sort documents into most relevant categories to the user
Replace laborious annotation by domain experts with a trainable, automated system
Discover conceptual links between previously unrelated scientific domains
Identify key articles for pertinent research
Integrate public, private and proprietary documents
www.go-kds.com
11
Drug Approval
Collecting informationOrganizing/Collating documents
Satisfying approval criteria
Life Science Research
Finding relevant literaturePrioritizing articles/reports
Discovering hidden connectionsDistributing information
Patent preparation
Searching patent databasesCollecting relevant documents
Synthesizing information
How is document classification useful?
12
Intelligent Text Mining: Therapeutic Courses
One Reel Two client is using Classification System to rapidly sort through large volumes of medical documentation in disparate therapeutic areas.
The Problem: Client must generate E-Learning Courses from hundreds of pages of reports, literature and product documentation supplied by client
Old Solution: Manually read through documents to find paragraphs related to ‘Diagnosis’, Etiology, Epidemiology etc.
New Solution: Use Reel Two Classification System to build a custom taxonomy, then automatically classify and extract relevant document sections into Therapeutic Area categories
13
Intelligent Text Mining – Patent Analysis
Search patent filings for the ideas or concepts behind one’s analysis
– Explore state of prior art, competitive landscape or ‘innovation gaps’
– Overcome intentionally vague language in patent filings
The Mechanism of Action listed for this patent is "Neurotransmitter release modulator." However Classification System identified that this chemical modulator
binds to the acetylcholine receptor, which is the true mechanism of action, and classified this patent in “MoA: Acetylcholinesterase”.
In an in vitro assay, 2-chloro-5-(3-(R)-pyrrolidinylmethoxy)-3-pyridinecarbaldoxime (Ia) exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0.012 nM.
ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic; tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant; uropathic; gastrointestinal; antiaddictive; gynecological.
MECHANISM OF ACTION - Neurotransmitter release modulator.
Identifying ‘Mechanism of Action’ in life science patents
Patents are classified according to a taxonomy built by the client:
Alzheimer’s Patents
MoA: 5-HT Inhibitor
MoA: Acetylcholinesterase
MoA: Antioxidant
MoA: Antiviral…
ExamplExample e ProjectProject
Sample Sample OutputOutput
14
“Life Science Information Management will form the largestunmet need for IT companies in the 21st Century”
Caroline Kovak,General Manager, IBM Life Sciences
15
1. Search for a particular GO term by opening one of the main branches
Appendix: GO KDS Interface
17
3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search with this term.’
Appendix
19
5. Discover conceptual links to other GO categories. Click on the category to add the term to your search.
Appendix