Post on 23-Jan-2016
description
1
Learning the Structure of Markov Logic
Networks
Stanley Kok & Pedro Domingos
Dept. of Computer Science and Eng.
University of Washington
2
Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion
3
Motivation Statistical Relational Learning (SRL)
combines the benefits of: Statistical Learning: uses probability to handle
uncertainty in a robust and principled way Relational Learning: models domains with
multiple relations
4
Motivation Many SRL approaches combine a logical
language and Bayesian networks e.g. Probabilistic Relational Models
[Friedman et al., 1999]
The need to avoid cycles in Bayesian networks causes many difficulties [Taskar et al., 2002]
Started using Markov networks instead
5
Motivation Relational Markov Networks [Taskar et al., 2002]
conjunctive database queries + Markov networks Require space exponential in the size of the cliques
Markov Logic Networks [Richardson & Domingos, 2004]
First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)
6
Motivation Relational Markov Networks [Taskar et al., 2002]
conjunctive database queries + Markov networks Require space exponential in the size of the cliques
Markov Logic Networks [Richardson & Domingos, 2004]
First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)
This paper develops a fast algorithm that learns MLN structure Most powerful SRL learner to date
7
Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion
8
Markov Logic Networks
First-order KB: set of hard constraints Violate one formula, a world has zero probability
MLNs soften constraints OK to violate formulas The fewer formulas a world violates,
the more probable it is Gives each formula a weight,
reflects how strong a constraint it is
9
MLN Definition A Markov Logic Network (MLN) is a set of
pairs (F, w) where F is a formula in first-order logic w is a real number
Together with a finite set of constants,it defines a Markov network with One node for each grounding of each predicate
in the MLN One feature for each grounding of each formula F
in the MLN, with the corresponding weight w
10
Ground Markov Network
Student(STAN)
Professor(PEDRO)
AdvisedBy(STAN,PEDRO)
Professor(STAN)
Student(PEDRO)
AdvisedBy(PEDRO,STAN)
AdvisedBy(STAN,STAN)
AdvisedBy(PEDRO,PEDRO)
AdvisedBy(S,P) ) Student(S) ^ Professor(P)2.7
constants: STAN, PEDRO
11
MLN Model
12
MLN Model
Vector of value assignments to ground predicates
13
MLN Model
Vector of value assignments to ground predicates
Partition function. Sums over all possiblevalue assignments to ground predicates
14
MLN Model
Vector of value assignments to ground predicates
Partition function. Sums over all possiblevalue assignments to ground predicates
Weight of ith formula
15
MLN Model
Vector of value assignments to ground predicates
Partition function. Sums over all possiblevalue assignments to ground predicates
Weight of ith formula
# of true groundings of ith formula
16
MLN Weight Learning
Likelihood is concave function of weights Quasi-Newton methods to find optimal weights
e.g. L-BFGS [Liu & Nocedal, 1989]
17
MLN Weight Learning
Likelihood is concave function of weights Quasi-Newton methods to find optimal weights
e.g. L-BFGS [Liu & Nocedal, 1989]
SLOW#P-complete
18
MLN Weight Learning
Likelihood is concave function of weights Quasi-Newton methods to find optimal weights
e.g. L-BFGS [Liu & Nocedal, 1989]
SLOW#P-completeSLOW
#P-complete
19
MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975]
20
MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975]
21
MLN Structure Learning
R&D “learned” MLN structure in two disjoint steps: Learn first-order clauses with an off-the-shelf
ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997]) Learn clause weights by optimizing
pseudo-likelihood Unlikely to give best results because CLAUDIEN
find clauses that hold with some accuracy/frequency in the data
don’t find clauses that maximize data’s (pseudo-)likelihood
22
Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion
23
This paper develops an algorithm that: Learns first-order clauses by directly optimizing
pseudo-likelihood Is fast enough Performs better than R&D, pure ILP,
purely KB and purely probabilistic approaches
MLN Structure Learning
24
Structure Learning Algorithm
High-level algorithmREPEAT
MLN Ã MLN [ FindBestClauses(MLN)UNTIL FindBestClauses(MLN) returns NULL
FindBestClauses(MLN)Create candidate clausesFOR EACH candidate clause c
Compute increase in evaluation measureof adding c to MLN
RETURN k clauses with greatest increase
25
Structure Learning Evaluation measure Clause construction operators Search strategies Speedup techniques
26
Evaluation Measure
R&D used pseudo-log-likelihood
This gives undue weight to predicates with large # of groundings
27
Weighted pseudo-log-likelihood (WPLL)
Gaussian weight prior Structure prior
Evaluation Measure
28
Weighted pseudo-log-likelihood (WPLL)
Gaussian weight prior Structure prior
Evaluation Measure
weight given to predicate r
29
Weighted pseudo-log-likelihood (WPLL)
Gaussian weight prior Structure prior
Evaluation Measure
sums over groundings of predicate r
weight given to predicate r
30
Weighted pseudo-log-likelihood (WPLL)
Gaussian weight prior Structure prior
Evaluation Measure
sums over groundings of predicate r
weight given to predicate r CLL: conditional log-likelihood
31
Clause Construction Operators Add a literal (negative/positive) Remove a literal Flip signs of literals Limit # of distinct variables to restrict
search space
32
Beam Search
Same as that used in ILP & rule induction Repeatedly find the single best clause
33
Shortest-First Search (SFS)
1. Start from empty or hand-coded MLN2. FOR L Ã 1 TO MAX_LENGTH3. Apply each literal addition & deletion to
each clause to create clauses of length L4. Repeatedly add K best clauses of length L
to the MLN until no clause of length L improves WPLL
Similar to Della Pietra et al. (1997), McCallum (2003)
34
Speedup Techniques FindBestClauses(MLN)
Creates candidate clausesFOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS) of adding c to MLN
RETURN k clauses with greatest increase
35
Speedup Techniques FindBestClauses(MLN)
Creates candidate clausesFOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)of adding c to MLN
RETURN k clauses with greatest increase
SLOWMany candidates
36
Speedup Techniques FindBestClauses(MLN)
Creates candidate clausesFOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)of adding c to MLN
RETURN k clauses with greatest increase
SLOWMany candidates
SLOWMany CLLs
SLOWEach CLL involves a#P-complete problem
37
Speedup Techniques FindBestClauses(MLN)
Creates candidate clausesFOR EACH candidate clause c
Compute increase in WPLL (using L-BFGS)of adding c to MLN
RETURN k clauses with greatest increase
SLOWMany candidates
NOT THAT FAST
SLOWMany CLLs
SLOWEach CLL involves a#P-complete problem
38
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
39
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
40
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
41
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
42
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
43
Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding
44
Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion
45
Experiments UW-CSE domain
22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc. 10 types, e.g., Person, Course, Quarter, etc. # ground predicates ¼ 4 million # true ground predicates ¼ 3000 Handcrafted KB with 94 formulas
Each student has at most one advisor If a student is an author of a paper, so is her advisor
Cora domain Computer science research papers Collective deduplication of author, venue, title
46
Systems
MLN(SLB): structure learning with beam searchMLN(SLS): structure learning with SFS
47
Systems
MLN(SLB) MLN(SLS)
KB: hand-coded KBCL: CLAUDIENFO: FOILAL: Aleph
48
Systems
MLN(SLB) MLN(SLS)
KBCLFOAL
MLN(KB)MLN(CL)MLN(FO)MLN(AL)
49
Systems
MLN(SLB) MLN(SLS)
NB: Naïve Bayes
BN: Bayesian
networks
KBCLFOAL
MLN(KB)MLN(CL)MLN(FO)MLN(AL)
50
Methodology UW-CSE domain
DB divided into 5 areas: AI, Graphics, Languages, Systems, Theory
Leave-one-out testing by area Measured
average CLL of the ground predicates average area under the precision-recall curve
of the ground predicates (AUC)
51
0.5330.472
0.306
0.140 0.148
0.429
0.1700.131 0.117
0.266
0.0
0.3
0.6
UW-CSE-0.061 -0.088
-0.151-0.208 -0.223
-0.142
-0.574-0.661
-0.579
-0.812
-1.0
-0.5
0.0M
LN(S
LS)
MLN
(SLB
)
MLN
(CL)
MLN
(FO
)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
CLL
AU
C
MLN
(SLS
)M
LN(S
LB)
MLN
(CL)
MLN
(FO)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
UW-CSE
52
0.5330.472
0.306
0.140 0.148
0.429
0.1700.131 0.117
0.266
0.0
0.3
0.6
UW-CSE-0.061 -0.088
-0.151-0.208 -0.223
-0.142
-0.574-0.661
-0.579
-0.812
-1.0
-0.5
0.0M
LN(S
LS)
MLN
(SLB
)
MLN
(CL)
MLN
(FO
)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
CLL
AU
C
MLN
(SLS
)M
LN(S
LB)
MLN
(CL)
MLN
(FO)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
UW-CSE
53
0.5330.472
0.306
0.140 0.148
0.429
0.1700.131 0.117
0.266
0.0
0.3
0.6
UW-CSE-0.061 -0.088
-0.151-0.208 -0.223
-0.142
-0.574-0.661
-0.579
-0.812
-1.0
-0.5
0.0M
LN(S
LS)
MLN
(SLB
)
MLN
(CL)
MLN
(FO
)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
CLL
AU
C
MLN
(SLS
)M
LN(S
LB)
MLN
(CL)
MLN
(FO)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
UW-CSE
54
0.5330.472
0.306
0.140 0.148
0.429
0.1700.131 0.117
0.266
0.0
0.3
0.6
UW-CSE-0.061 -0.088
-0.151-0.208 -0.223
-0.142
-0.574-0.661
-0.579
-0.812
-1.0
-0.5
0.0M
LN(S
LS)
MLN
(SLB
)
MLN
(CL)
MLN
(FO
)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
CLL
AU
C
MLN
(SLS
)M
LN(S
LB)
MLN
(CL)
MLN
(FO)
MLN
(AL)
MLN
(KB)
CL
FO AL
KB
UW-CSE
55
0.533
0.472
0.390 0.397
0.0
0.3
0.6
UW-CSE-0.061 -0.088
-0.370
-0.166
-1.0
-0.5
0.0
CLL
AU
C
MLN
(SLS
)
MLN
(SLB
)
NB
BN
MLN
(SLS
)
MLN
(SLB
)
NB
BN
UW-CSE
56
Timing MLN(SLS) on UW-CSE
Cluster of 15 dual-CPUs 2.8 GHz Pentium 4 machines
Without speedups: did not finish in 24 hrs With speedups: 5.3 hrs
57
4.0
21.6
8.46.5
4.1
24.8
0.0
10.0
20.0
30.0
Lesion Study Disable one speedup technique at a time; SFS
UW-CSE (one-fold)
Hour
all speedups
no clausesampling
no predicatesampling
don’t avoidredundancy
no looseconverg.
threshold
no weight thresholding
58
Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion
59
Future Work Speed up counting of # true
groundings of clause Probabilistically bound the loss in
accuracy due to subsampling Probabilistic predicate discovery
60
Conclusion Markov logic networks: a powerful combination
of first-order logic and probability Richardson & Domingos (2004) did not learn
MLN structure We develop an algorithm that automatically learns
both first-order clauses and their weights We develop speedup techniques to make our
algorithm fast enough to be practical We show experimentally that our algorithm
outperforms Richardson & Domingos Pure ILP Purely KB approaches Purely probabilistic approaches
(For software, email: koks@cs.washington.edu)