Statistical Learning from Relational Data
description
Transcript of Statistical Learning from Relational Data
Statistical Learning from Relational Data
Daphne KollerStanford University
Joint work with many many people
Relational Data is Everywhere
The web Webpages (& the entities they represent),
hyperlinks Social networks
People, institutions, friendship links Biological data
Genes, proteins, interactions, regulation Bibliometrics
Papers, authors, journals, citations Corporate databases
Customers, products, transactions
Relational Data is Different
Data instances not independent Topics of linked webpages are correlated
Data instances are not identically distributed: Heterogeneous instances (papers, authors)
No IID assumption
This is a good thing
New Learning Tasks Collective classification of related instances
Labeling an entire website of related webpages
Relational clustering Finding coherent clusters in the genome
Link prediction & classification Predicting when two people are likely to be friends
Pattern detection in network of related objects Finding groups (research groups, terrorist groups)
Probabilistic Models Uncertainty model:
space of “possible worlds”; probability distribution over this space.
Worlds: often defined via a set of state variables medical diagnosis: diseases, symptoms, findings, …
each world: an assignment of values to variables
Number of worlds is exponential in # of vars 2n if we have n binary variables
Outline
Relational Bayesian networks* Relational Markov networks Collective Classification Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
Bayesian Networks
nodes = variablesedges = direct influence
Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,lowA B C
CPD P(G|D,I)
Job
Grade
SAT
IntelligenceDifficulty
Bayesian Networks: Problem
Bayesian nets use propositional representation Real world has objects, related to each other
Intelligence Difficulty
Grade
Intell_Jane Diffic_CS101
Grade_Jane_CS101
Intell_George Diffic_Geo101
Grade_George_Geo101
Intell_George Diffic_CS101
Grade_George_CS101A C
These “instances” are not independent
Relational Schema Specifies types of objects in domain, attributes of
each type of object & types of relations between objects
Teach
Student
Intelligence
Registration
Grade
Satisfaction
Course
Difficulty
Professor
Teaching-Ability
In
Take
ClassesClasses
RelationsRelationsAttributesAttributes
St. Nordaf University
Tea
ches
Tea
ches
In-course
In-course
Registered
In-course
Prof. SmithProf. Jones
George
Jane
Welcome to
CS101
Welcome to
Geo101
Teaching-abilityTeaching-ability
Difficulty
Difficulty Registered
RegisteredGrade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
World
Relational Bayesian Networks
Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies
Links define potential interactions
StudentIntelligence
RegGrade
Satisfaction
CourseDifficulty
ProfessorTeaching-Ability
[K. & Pfeffer; Poole; Ngo & Haddawy]
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,lowA B C
Prof. SmithProf. Jones
Welcome to
CS101
Welcome to
Geo101
RBN Semantics
Teaching-abilityTeaching-ability
Difficulty
Difficulty
Grade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
George
Jane
Ground model: variables: attributes of all objects dependencies: determined by relational links & template model
Welcome to
CS101
Welcome to
CS101
low / high
The Web of Influence
0% 50% 100%0% 50% 100%
Welcome to
Geo101 A
C
low high
0% 50% 100%
easy / hard
Outline
Relational Bayesian networks* Relational Markov networks†
Collective Classification Relational clustering
* with Avi Pfeffer, Nir Friedman, Lise Getoor
† with Ben Taskar, Pieter Abbeel
Why Undirected Models? Symmetric, non-causal interactions
E.g., web: categories of linked pages are correlated
Cannot introduce direct edges because of cycles
Patterns involving multiple entities E.g., web: “triangle” patterns Directed edges not appropriate
“Solution”: Impose arbitrary direction Not clear how to parameterize CPD for variables
involved in multiple interactions Very difficult within a class-based
parameterization[Taskar, Abbeel, K. 2001]
Markov Networks
Laura
Noah
Mary
James
N)(L,N)(M,M)(J,L)(K,L)(J,K)(J,
ZN)M,L,K,P(J,
1
Kyle
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
Template potential
Relational Markov Networks
Universals: Probabilistic patterns hold for all groups of objects
Locality: Represent local probabilistic dependencies Sets of links give us possible interactions
Study Group
Student2
Reg2
GradeIntelligence
Course
Reg1Grade
Student1
Difficulty
Intelligence
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
Template potential
Welcome to
CS101
RMN Semantics
Welcome to
Geo101
Difficulty
Difficulty
Grade
Grade
Intelligence
Intelligence
George
Jane
Jill
Intelligence
Geo Study Group
CS Study Group
Grade
Grade
Outline
Relational Bayesian Networks Relational Markov Networks Collective Classification*
Discriminative training Web page classification Link prediction
Relational clustering
* with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel
Model Structure
ProbabilisticRelational
ModelCourse
Student
Reg
Training Data
New Data
Learning
Inference
Conclusions
Collective Classification
Train on one year of student intelligence, course difficulty, and grades Given only grades in following year, predict all students’ intelligence
Example:
Features: .x
Labels: .y*
Features: ’.x Labels: ’.y
Learning RMN Parameters
Student2
Reg2
GradeIntelligence
Course
Reg1Grade
Student1
Difficulty
IntelligenceTemplate potential
Study Group
AAABACBABBBCCACBCC
Parameterize potentials as log-linear model
)exp(1
).( )(xfwxw
wT
ZP
)exp().,.( 21 CCCCAAAA fwfwGRGR
Max Likelihood Estimation
maximizew
Estimation Classification
argmaxy
.x
.y* ).|.(log xy*w P ).,.(log xy*w P
We don’t care about the joint distribution P(.x, .y)
)'.|'.(log xyw P
Web KB
Tom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Project-of
Member
[Craven et al.]
Web Classification Experiments
WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links
Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations
Professordepartment
extractinformationcomputersciencemachinelearning
…
Standard Classification
Categories:facultycourseprojectstudentother
Page
...
Category
Word1 WordN
Standard Classification
...LinkWordN
workingwithTom Mitchell …
Page
...
Category
Word1 WordN
00.020.040.060.080.1
0.120.140.160.18
Logistic
test
set
err
or
4-fold CV:Trained on 3 universities
Tested on 4th
Discriminatively trained naïve Markov
= Logistic Regression
Power of ContextProfessor
?Student? Post-doc?
Collective Classification
...
PageCategory
Word1 WordN
From-
Link ...
PageCategory
Word1 WordN
To-
CCCFCPCSFCFFFPFSPCPFPPPSSCSFSPSS
Compatibility (From,To)FT
Collective Classification
...
PageCategory
Word1 WordN
From-
Link ...
PageCategory
Word1 WordN
To-
Logistic Links
Classify all pages collectively,
maximizing the joint label probability
00.020.040.060.080.1
0.120.140.160.18
test
set
err
or
[Taskar, Abbeel, K., 2002]
More Complex Structure
More Complex Structure
C
Wn
W1Faculty
S
Students
S
Courses
Collective Classification: Results
00.020.040.060.080.1
0.120.140.160.18
Logistic Links Section Link+Section[Taskar, Abbeel, K., 2002]
test
set
err
or
35.4% error reduction over logistic
Max Conditional Likelihood
maximizew
Estimation Classification
argmaxy
)(log..).|.(log xyx,fwxy ww ZP T
xyfwx
xyw
w .,.exp)(
1).|.( T
ZP
)'.|'.(log xyw P xyfw '.,'. T).|.(log xy*w P.x
.y*
We don’t care about the conditional distribution P(.y |
.x)
*yy
yyx,fw
*yx,fw
].[..
..
T
T
margin # labelingmistakes in y
Max Margin Estimation
[Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003])
Quadratic program
Exponentially many constraints
maximize ||w||=1
Estimation Classification
argmaxy xyfw '.,'. T.x
.y*
What we really want: correct class labels
Max Margin Markov Networks
We use structure of Markov network to provide equivalent formulation of QP Exponential only in tree width of network Complexity = max-likelihood classification
Can solve approximately in networks where induced width is too large Analogous to loopy belief propagation
Can use kernel-based features! SVMs meet graphical models
[Taskar, Guestrin, K., 2003]
WebKB Revisited
00.020.040.060.080.1
0.120.140.160.180.2
Test
Err
or
Logistic likelihood max margin
16.1% relative reduction in error relative to cond. likelihood RMNs
Predicting Relationships
Even more interesting: relationships between objects
Tom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Member
Member
Predicting Relations
0
5
10
15
20
25
30
Flat Collective
Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world
Relation
...
Page
Word1 WordN
From-
...
Page
Word1 WordN
To-
Exists/Type...LinkWord1 LinkWordN
Category Category
72.9% error reduction over flat
[Taskar, Wong, Abbeel, K., 2003]
Outline
Relational Bayesian Networks Relational Markov Networks Collective Classification Relational clustering
Movie data* Biological data†
* with Ben Taskar, Eran Segal
† with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein
Model Structure
ProbabilisticRelational
ModelCourse
Student
Reg
Unlabeled Relational Data
Learning
Relational Clustering
Given only students’ grades, cluster similar students
Example:
Clustering of instances
Learning w. Missing Data: EM
EM Algorithm applies essentially unchanged E-step computes expected sufficient statistics,
aggregated over all objects in class M-step uses ML (or MAP) parameter estimation
Key difference: In general, the hidden variables are not
independent Computation of expected sufficient statistics
requires inference over entire network
P(Registration.Grade | Course.Difficulty, Student.Intelligence)
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
Learning w. Missing Data: EM
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
low / higheasy / hard
A B C
CoursesStudents
[Dempster et al. 77]
Movie Data
Internet Movie Databasehttp://www.imdb.com
Actor
Director
Movie
Genres Rating
Year#Votes
MPAA Rating
Discovering Hidden Types
Type Type
Type
[Taskar, Segal, K., 2001]
Learn model using EM
Directors
Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher
Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola
Actors
Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman
Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger
…
MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson
…
Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October
Discovering Hidden Types
[Taskar, Segal, K., 2001]
Biology 101: Gene Expression
Gene 2
CodingControl
Gene 1
CodingControl
DNA
RNA
Protein
Swi5 Transcription factor
Sw
i5
Cells express different subsets of their genesin different tissues and under different conditions
Gene Expression Microarrays
Measure mRNA level for all genes in one condition Hundreds of experiments Highly noisy
Expression of gene i in experiment jExperiment
s
Gen
es
Induced
Repressed
Standard Analysis Cluster genes by similarity of expression profiles Manually examine clusters to understand what’s
common to genes in cluster
Clustering
General Approach Expression level is a function of gene
properties and experiment properties Learn model that best explains the data• Observed properties: gene sequence, array condition, …• Hidden properties: gene clusterGene Experiment
Expression
Properties of
Gene iProperties of Experiment j
Expression levelof Gene i
in Experiment j
Attributes Attributes
Level
• Assignment to hidden variables (e.g., module assignment)• Expression level as function of properties
Level
Gene ExperimentCluster
Expression
ID
Clustering as a PRM
P(Ei.L | g.C)g.C
1
2
3
0
0
0
g.C
g.E1 g.E2 g.Ek
CPD 2
CPD k
Naïve Bayes
CPD 1
Modular Regulation Learn functional modules:
Clusters of genes that are similarly controlled Learn control program for modules
Expression as function of control genes
HAP4
CMK1 truefalse
truefalse
[Segal, Regev, Pe’er, Koller, Friedman, 2003]
Level
GeneControlk
ExperimentCluster
Expression
Control2Control1
Module Network PRM
HAP4
CMK1 truefalse
truefalse
00
0
Cluster 1BMH1
Yer184c
true
false
truefalse
GIC2 USV1FAR1 true
false
true
truefalse
false
true
true
false
USV1
truefalse
APG1
Cluster 2
Activity levelof control
genein experiment
Experimental Results
Yeast Stress Data (Gasch et al.) 2355 genes that showed activity 173 experiments (microarrays):
Diverse environmental stress conditions (e.g. heat shock)
Learned module network with 50 modules: Cluster assignments are hidden variables Structure of dependency trees unknown
Learned model using structural EM algorithm
Segal et al., Nature Genetics, 2003
Biological Evaluation
Find sets of co-regulated genes (regulatory module)
Find the regulators of each module
[Segal et al., Nature Genetics, 2003]
46/50
30/50
Experimental Results Hypothesis: Regulator ‘X’ regulates process ‘Y’ Experiment: Knock out ‘X’ and rerun the experiment
HAP4
CMK1 truefalse
truefalse X?
[Segal et al., Nature Genetics, 2003]
wt Ypl230w
0 3 5 7 9 24 0 2 5 7 9 24
(hrs.)
>16x
341 differentially expressed genes
0 7 15 30 60 0 7 15 30 60
wt (min.)
Ppt1
>4x
602
0 5 15 30 60 0 5 15 30 60
wt (min.)
Kin82
>4x
281
Differentially Expressed Genes
[Segal et al., Nature Genetics, 2003]
Were the differentially expressed genes predicted as targets?
Rank modules by enrichment for diff. expressed genes
# Module Significance
14 Ribosomal and phosphate metabolism 8/32, 9e 3
11 Amino acid and purine metabolism 11/53, 1e 2
15 mRNA, rRNA and tRNA processing 9/43, 2e 2
39 Protein folding 6/23, 2e 2
30 Cell cycle 7/30, 2e 2
Ppt1
# Module Significance
39Protein folding 7/23, 1e-4
29Cell differentiation 6/41, 2e-2
5 Glycolysis and folding 5/37, 4e-2
34Mitochondrial and protein fate 5/37, 4e-2
Ypl230w
# Module Significance
3 Energy and osmotic stress I 8/31, 1e 4
2 Energy, osmolarity & cAMP signaling 9/64, 6e 3
15 mRNA, rRNA and tRNA processing 6/43, 2e 2
Kin82
Biological Experiments Validation
All regulators regulate predicted modules
[Segal et al., Nature Genetics, 2003]
Biology 102: Pathways
Pathways are sets of genes that act together to achieve a common function
Finding Pathways: Attempt I
Use protein-protein interaction data
Finding Pathways: Attempt I
Use protein-protein interaction data
Finding Pathways: Attempt I
Use protein-protein interaction data
Problems: Data is very noisy Structure is lost:
Large connected component in interaction graph (3527/3589 genes)
Finding Pathways: Attempt II
Use expression microarray clusters
Pathway I
Pathway II
Problems: Expression is only
‘weak’ indicator of interaction
Interacting pathways are not separable
Finding Pathways: Our Approach
Use both types of data to find pathways Find “active” interactions using gene expression Find pathway-related co-expression using
interactions
Pathway I
Pathway II
Pathway III
Pathway IV
[Segal, Wang, K., 2003]
Probabilistic Model
...
Pathway
Exp1 ExpN
Gene
Interacts
[Segal, Wang, K., 2003]
1
...
Pathway
Exp1 ExpN
Gene2
Expression level in N arrays
protein productinteraction
Compatibilitypotential
(g.C,g.C)g1.C g2.C
123123123
111222333
1
1
2
3
0
0
Cluster all genes collectively,
maximizing the joint model likelihood
Capturing Protein Complexes
Independent data set of interacting proteins
0
50
100
150
200
250
300
350
400
0 10 20 30 40 50 60 70 80 90 100Complex Coverage (%)
Nu
m C
om
ple
xes
Our method
Standard expression clustering
124 complexes covered at 50% for our method
46 complexes covered at 50% for clustering
[Segal, Wang, K., 2003]
YHR081WRRP40RRP42MTR3RRP45RRP4RRP43DIS3TRM7SKI6RRP46CSL4
RNAse Complex Pathway
YHR081W
SKI6
RRP42
RRP45
RRP46
RRP43TRM7RRP40
MTR3RRP4
DIS3
CSL4
Includes all 10 known pathway genes
Only 5 genes found by clustering
[Segal, Wang, K., 2003]
Interaction Clustering RNAse complex found by interaction
clustering as part of cluster with 138 genes
[Segal, Wang, K., 2003]
Truth in Advertising Huge graphical models:
3000-50,000 hidden variables Hundreds of thousands of observed nodes Very densely connected
Learning: Multiple iterations of model updates Each requires running inference on the model
Inference: Exact inference is intractable Use belief propagation Single inference iteration: 1-6 hours Algorithmic ideas key to scaling
Relational Data: A New Challenge
Data consists of different types of instances
Instances are related in complex networks
Instances are not independent
New tasks for machine learning Collective classification Relational clustering Link prediction Group detection
Opportunity
http://robotics.stanford.edu/~koller/