Ryutaro Ichise Principles of Informatics Research Division, National Institute of Informatics
BeeSpace Informatics Research
-
Upload
beatrice-barton -
Category
Documents
-
view
32 -
download
0
description
Transcript of BeeSpace Informatics Research
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Institute for Genomic Biology
Statistics
Graduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009 1
Goal of Informatics Research
• Develop general and scalable computational methods to enable
– Semantic integration of data and information
– Effective information access and exploration
– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and computer science
– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
3
Informatics Research Accomplishments
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis Test
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]
Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]
Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],
[Chee & Schatz 08]
Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]
Automatic Function Annotation [He et al. 09/10]
4
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
5
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
7
Entity & Relation Extraction
Gene X Gene Y
Bcd hb
…. ….
… …
Genetic Interaction
Gene X Anatomy Y
Bcd embryo
Hb egg
… …
Expression Location
…
8
Lopes FJ et al., 2005 J. Theor. Biol.
General Approach: Machine Learning
• Computers learn from labeled examples to compute a function to predict labels of new examples
• Examples of predictions
– Given a phrase, predict whether it is a gene name
– Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation
• Many learning methods are available, but training data isn’t always available
9
Extraction Example 1: Gene Name Recognition
… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.
10
Gene?
Gene? Gene?
Features for Recognizing Genes
• Syntactic clues:
– Capitalization (especially acronyms)
– Numbers (gene families)
– Punctuation: -, /, :, etc.
• Contextual clues:
– Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc.
– Global: same noun phrase occurs several times in the same article
11
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:
– y = gene & candidate phrase starts with a capital letter
– y = gene & candidate phrase contains digits
• Estimate i with training data
12
Gene Name Disambiguation
• Gene names can be common English words:
for (foraging), in (inturned), similar (sima), yellow (y), black (b)…
• Solution:
– Disambiguate by looking at the context of the candidate word
– Train a classifier
14
Sample Disambiguation Results
16
... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359
(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497
function for the for gene in sensory responsiveness and … -0.582 +5.980
the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in
assays of black mutants and although … +9.759
“foraging”, “for”
“black”
Nov 27, 2007 17
Problem of Domain Overfitting
gene name recognizer 54.1%
gene name recognizer 28.1%
ideal setting
realistic setting
wingless
daughterless
eyeless
apexless
…
fly
Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
18
Generalizable Feature: “w+2 = expressed”
Generalizability-Based Feature Ranking
…training
data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167…… 19
20
Effectiveness of Domain Adaptation
Fly + Mouse Yeastgene name recognizer 63.3%
Fly + Mouse Yeastgene name recognizer 75.9%
standard learning
domain adaptive learning
More Results on Domain Adaptation
Exp Method Precision Recall F1
F+M→Y Baseline 0.557 0.466 0.508
Domain 0.575 0.516 0.544
% Imprv. +3.2% +10.7% +7.1%
F+Y→M Baseline 0.571 0.335 0.422
Domain 0.582 0.381 0.461
% Imprv. +1.9% +13.7% +9.2%
M+Y→F Baseline 0.583 0.097 0.166
Domain 0.591 0.139 0.225
% Imprv. +1.4% +43.3% +35.5%
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21
Extraction Example 2: Genetic Interaction Relation
22
Gene
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.
Solution: Pseudo Training Data
24
Gene:
Bcd +
These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented development.
Large-Scale Entity/Relation Extraction
• Entity annotation
• Relation extraction
Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +
machine learning
Anatomy FlyBase Dictionary string search
Chemical MeSH, Biosis, … Dictionary string search
Behavior “x x behavior” pattern search
Relation Type MethodRegulatory Pre-defined pattern + machine learning
Expressed In Co-occurrence + relevant keywords
Gene Behavior Co-occurrence
Gene Chemical Co-occurrence53
Space-Region Navigation
Literature Spaces
BeeFly
Behavior
Bird…
Topic Regions
Bee Forager
MAP MAP
Bird Singing
EXTRACT
…Fly Rover
EXTRACT
SWITCHING
Intersection, Union,…
Intersection, Union,…
My Regions/Topics
My Spaces
28
General Approach: Language Models
• Topic = word distribution
• Modeling text in a space with mixture models of multinomial distributions
• Text Mining = Parameter Estimation + Inferences
• Matching = Computer similarity between word distributions
• Users can “control” a model by specifying topic preferences
29
A Sample Topic & Corresponding Space
filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626
actin filamentsflight muscleflight muscles
labels
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
30
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space
• Retrieval algorithm:
– Query word distribution: p(w|Q)
– Document word distribution: p(w|D)
– Score a document based on similarity of Q and D
• Leverage existing retrieval toolkits: Lemur/Indri
Vocabularyw D
QQDQ wp
wpwpDDQscore
)|(
)|(log)|()||(),(
31
EXTRACT: Space Topic/Region
• Assume k topics, each being represented by a word distribution
• Use a k-component mixture model to fit the documents in a given space (EM algorithm)
• The estimated k component word distributions are taken as k topic regions
| |
1 1
log ( | ) log[ ( | ) (1 ) ( | )]D k
i B j i jD C i j
p C p D p D
Likelihood:
Maximum likelihood estimator: * arg max ( | )p C
Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32
User-Controlled Exploration: Sample Topic 1
age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439
Prior:
labor 0.2division 0.2
33
behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045
Prior:
behavioral 0.2maturation 0.2
34
User-Controlled Exploration: Sample Topic 2
foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
Exploit Prior for Concept Switching
35
Gene product
Expression
Sequence
Interactions
Mutations
General Functions
Multi-Aspect Gene Summary
Automated Gene Summarization?
General Entity Summarizer
• Task: Given any entity and k aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method:
– Train a recognizer for each aspect
– Given an entity, retrieve sentences relevant to the entity
– Classify each sentence into one of the k aspects
– Choose the best sentences in each category
40
Further Generalizations
• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method:
– Train a recognizer for each aspect
– Given an entity, retrieve sentences relevant to the entity
– Classify each sentence into one of the k aspects
– Choose the best sentences in each category
41
New method based on mixture modeland regularized optimization
Annotating Gene Lists: GO Terms vs. Literature Mining
Limitations of GO annotations: - Labor-intensive- Limited Coverage
Literature Mining:- Automatic - Flexible exploration in the entire literature space
For any term:
test its significance
Segmentation 56.0Pattern 34.2
Cell_cycle 25.6Development 22.1
Regulation 20.4…
Enriched concepts
Interactive analysis
Gene group
BcdCad…Tll
Entrez Gene
…
Document sets
For any gene:retrieve
its relevant documents
Bcd
Cad
Tll
Overview of Gene List Annotator
Intuition for Literature-based Annotation
Gene TPI1 GPM1 PGK1 TDH3 TDH2
protein_kinase 0 0 2 0 0
decarboxylase 10 0 10 7 6
protein 39 26 65 44 33
stationary_phase 2 7 3 4 2
energy_metabolism 4 5 5 8 0
oscillation 0 0 0 0 1
Likelihood Ratio Test with 2-Poisson Mixture Model
Dataset distribution: Poisson(λ;d)
Reference distribution: Poisson(λ0;d)
Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment
GO Theme Related Annotator terms
neurogenesis axon guidance, growth cone,
commissural axon, proneural gene
synaptic transmission synaptic vesicle, neurotransmitter
release, synaptic transmission, sodium
channel
cytoskeletal protein alpha tubulin, actin filament
cell communication tight junction, heparan sulfate
proteoglycan47
Discovering Novel Themes
• Gene List: 69 genes up-regulated by the methoprene treatment
Theme Annotator terms
muscle flight muscle, muscle myosin, nonmuscle
myosin, light chain, myosin ii, thick
filament, thin filament, striated muscle
synaptic transmission neurotransmitter release, synaptic
transmission, synaptic vesicle
signaling pathway notch signal
48
Summary
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
Users
Function Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
49
Machine Learning + Language Models + Minimum Human Effort
General and scalable, but there’s room for deeper semantics
Looking Ahead…
• Knowledge integration, inferences
• Support for hypothesis formulation and testing
50
51
Exploring Knowledge Space
Gene A2
Gene A1
Gene A4
Gene A3
Gene A4’
Gene A1’
Behavior B4Behavior B3
Behavior B2
Behavior B1
isa isaCo-occur-fly
Orth-mosCo-occur-mos
Co-occur-bee
Co-occur-fly
Regorth
RegReg
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
Gene A5Reg
P= PathBetween({Z, B4, {co-occur, reg,isa})
52
Full-Fledged BeeSpace V5
BiomedicalLiterature
Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …
ExperimentData
Analysis
Additional entities and relations
Expert knowledge
InferencesHypothesis Formulation & Testing
Thanks to
Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)
Qiaozhu Mei (UIUC/Michigan)
& Bruce Schatz (PI, BeeSpace)
53