Markov Logic: A Representation Language for Natural Language Semantics

Markov Logic:A Representation Language for

Natural Language Semantics

Pedro DomingosDept. Computer Science & Eng.

University of Washington

(Based on joint work with Stanley Kok,Matt Richardson and Parag Singla)

Overview

Motivation Background Representation Inference Learning Applications Discussion

Motivation

Natural language is characterized by Complex relational structure High uncertainty (ambiguity, imperfect knowledge)

First-order logic handles relational structure Probability handles uncertainty Let’s combine the two

Markov Logic[Richardson & Domingos, 2006]

Syntax: First-order logic + Weights Semantics: Templates for Markov nets Inference: Weighted satisfiability + MCMC Learning: Voted perceptron + ILP

Overview


Markov Networks Undirected graphical models

B

DC

A

1( ) ( )c

c

P X XZ

3.7 if A and B

( , ) 2.1 if A and B

0.7 otherwise

2.3 if B and C and D( , , )

5.1 otherwise

A B

B C D

( )cX c

Z X Potential functions defined over cliques

Markov Networks Undirected graphical models

B

DC

A

Potential functions defined over cliques

Weight of Feature i Feature i

exp ( )i iX i

Z w f X

1( ) exp ( )i i

i

P X w f XZ

1 if A and B ( , )

0 otherwise

1 if B and C and D( , , )

0

f A B

f B C D

First-Order Logic

Constants, variables, functions, predicatesE.g.: Anna, X, mother_of(X), friends(X, Y)

Grounding: Replace all variables by constantsE.g.: friends (Anna, Bob)

World (model, interpretation):Assignment of truth values to all ground predicates

Overview


Markov Logic Networks

A logical KB is a set of hard constraintson the set of possible worlds

Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible

Give each formula a weight(Higher weight Stronger constraint)

satisfiesit formulas of weightsexp)(worldP

Definition

A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number

Together with a set of constants,it defines a Markov network with One node for each grounding of each predicate in

the MLN One feature for each grounding of each formula F

in the MLN, with the corresponding weight w

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A) Smokes(B)

Cancer(B)

Suppose we have two constants: Anna (A) and Bob (B)

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Suppose we have two constants: Anna (A) and Bob (B)

More on MLNs

MLN is template for ground Markov nets Typed variables and constants greatly reduce

size of ground Markov net Functions, existential quantifiers, etc. MLN without variables = Markov network

(subsumes graphical models)

Relation to First-Order Logic

Infinite weights First-order logic Satisfiable KB, positive weights

Satisfying assignments = Modes of distribution MLNs allow contradictions between formulas

Overview


MPE/MAP Inference

Find most likely truth values of non-evidence ground atoms given evidence

Apply weighted satisfiability solver(maxes sum of weights of satisfied clauses)

MaxWalkSat algorithm [Kautz et al., 1997] Start with random truth assignment With prob p, flip atom that maxes weight sum;

else flip random atom in unsatisfied clause Repeat n times Restart m times

Conditional Inference

P(Formula|MLN,C) = ? MCMC: Sample worlds, check formula holds P(Formula1|Formula2,MLN,C) = ? If Formula2 = Conjunction of ground atoms

First construct min subset of network necessary to answer query (generalization of KBMC)

Then apply MCMC (or other)

Ground Network Construction

Initialize Markov net to contain all query preds For each node in network

Add node’s Markov blanket to network Remove any evidence nodes

Repeat until done

Probabilistic Inference Recall

Exact inference is #P-complete Conditioning on Markov blanket is easy:

Gibbs sampling exploits this

exp ( )( | ( ))

exp ( 0) exp ( 1)

i ii

i i i ii i

w f xP x MB x

w f x w f x

1( ) exp ( )i i

i

P X w f XZ

exp ( )i i

X i

Z w f X

Markov Chain Monte Carlo

Gibbs Sampler 1. Start with an initial assignment to nodes

2. One node at a time, sample node given others

3. Repeat

4. Use samples to compute P(X) Apply to ground network Initialization: MaxWalkSat Can use multiple chains

Overview


Learning Data is a relational database Closed world assumption (if not: EM) Learning parameters (weights)

Generatively: Pseudo-likelihood Discriminatively: Voted perceptron + MaxWalkSat

Learning structure Generalization of feature induction in Markov nets Learn and/or modify clauses Inductive logic programming with pseudo-

likelihood as the objective function

Generative Weight Learning

Maximize likelihood (or posterior) Use gradient ascent

Requires inference at each step (slow!)

Feature count according to data

Feature count according to model

log ( ) ( ) [ ( )]i Y ii

dP X f X E f Y

dw

Pseudo-Likelihood [Besag, 1975]

Likelihood of each variable given its Markov blanket in the data

Does not require inference at each step Widely used

( ) ( | ( ))x

PL X P x MB x

Most terms not affected by changes in weights After initial setup, each iteration takes

O(# ground predicates x # first-order clauses)

( ) ( 0) ( 0) ( 1) ( 1)i i i ix

nsat x p x nsat x p x nsat x

Optimization

where nsati(x=v) is the number of satisfied groundingsof clause i in the training data when x takes value v

Parameter tying over groundings of same clause Maximize using L-BFGS [Liu & Nocedal, 1989]

),(),()|(log yxnEyxnxyPw iwiwi

Gradient of Conditional Log Likelihood

# true groundings of

formula in DBExpected # of true

groundings – slow!

Approximate expected count by MAP count

Discriminative Weight Learning

Used for discriminative training of HMMs Expected count in gradient approximated by count in

MAP state MAP state found using Viterbi algorithm Weights averaged over all iterations

Voted Perceptron[Collins, 2002]

initialize wi=0 for t=1 to T do find the MAP configuration using Viterbi wi, = * (training count – MAP count) end for

HMM is special case of MLN Expected count in gradient approximated by count in

MAP state MAP state found using MaxWalkSat Weights averaged over all iterations

Voted Perceptron for MLNs[Singla & Domingos, 2004]

initialize wi=0 for t=1 to T do find the MAP configuration using MaxWalkSat wi, = * (training count – MAP count) end for

Overview


Applications to Date

Entity resolution (Cora, BibServ) Information extraction for biology

(won LLL-2005 competition) Probabilistic Cyc Link prediction Topic propagation in scientific communities Etc.

Entity Resolution Most logical systems make unique names assumption What if we don’t? Equality predicate: Same(A,B), or A = B Equality axioms

Reflexivity, symmetry, transitivity For every unary predicate P: x1 = x2 => (P(x1) <=> P(x2)) For every binary predicate R:

x1 = x2 y1 = y2 => (R(x1,y1) <=> R(x2,y2)) Etc.

But in Markov logic these are soft and learnable Can also introduce reverse direction:

R(x1,y1) R(x2,y2) x1 = x2 => y1 = y2 Surprisingly, this is all that’s needed

Example: Citation Matching

Markov Logic Formulation: Predicates Are two bibliography records the same?

SameBib(b1,b2) Are two field values the same?

SameAuthor(a1,a2)SameTitle(t1,t2)SameVenue(v1,v2)

How similar are two field strings?Predicates for ranges of cosine TF-IDF score:TitleTFIDF.0(t1,t2) is true iff TF-IDF(t1,t2)=0TitleTFIDF.2(a1,a2) is true iff 0 <TF-IDF(a1,a2) < 0.2Etc.

Markov Logic Formulation: Formulas

Unit clauses (defaults):! SameBib(b1,b2)

Two fields are same => Corresponding bib. records are same:Author(b1,a1) Author(b2,a2) SameAuthor(a1,a2) => SameBib(b1,b2)

Two bib. records are same => Corresponding fields are same:Author(b1,a1) Author(b2,a2) SameBib(b1,b2) => SameAuthor(a1,a2)

High similarity score => Two fields are same:TitleTFIDF.8(t1,t2) =>SameTitle(t1,t2)

Transitive closure (not incorporated in experiments):SameBib(b1,b2) SameBib(b2,b3) => SameBib(b1,b3)

25 predicates, 46 first-order clauses

What Does This Buy You?

Objects are matched collectively Multiple types matched simultaneously Constraints are soft, and strengths can be

learned from data Easy to add further knowledge Constraints can be refined from data Standard approach still embedded

Example

Record Title Author Venue

B1 Object Identification using CRFs Linda Stewart PKDD 04

B2 Object Identification using CRFs Linda Stewart 8th PKDD

B3 Learning Boolean Formulas Bill Johnson PKDD 04

B4 Learning of Boolean Formulas William Johnson 8th PKDD

Subset of a Bibliography Database

Standard Approach [Fellegi & Sunter, 1969]

b1=b2?

Sim(Linda Stewart, Linda Stewart)

b3=b4?

Author

Title

Venue

Sim(PKDD 04, 8th PKDD)

Sim(Object Identification using CRFs, Object Identification using CRFs)

Sim(Bill Johnson, William Johnson)

Title

Author

Sim(Learning Boolean Formulas, Learning of Boolean Expressions)


Venue

record-match node

field-similarity node (evidence

node)

What’s Missing?

b1=b2?


b3=b4?

Author

Title

Venue


Sim(Object Identification using CRF, Object Identification using CRF)


Title

Author



Venue

If from b1=b2, you infer that “PKDD 04” is same as “8th PKDD”, how canyou use that to help figure out if b3=b4?

Merging the Evidence Nodes

Author

Still does not solve the problem. Why?

b1=b2?


b3=b4?

Author

Title

Venue



Title

Author



Introducing Field-Match Nodes

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?

b1.V=b2.V?b3.V=b4.V?

b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?



Title

Author


field-match node

Full representation in Collective Model


Flow of Information

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?


b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?



Title

Author



Flow of Information

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?


b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?

Sim(Object Identification using CRF, Object Identification using CRF)


Title

Author



Flow of Information

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?


b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?



Title

Author



Experiments

Databases: Cora [McCallum et al., IRJ, 2000]:

1295 records, 132 papers BibServ.org [Richardson & Domingos, ISWC-03]:

21,805 records, unknown #papers

Goal: De-duplicate bib.records, authors and venues Pre-processing: Form canopies [McCallum et al, KDD-00 ]

Compared with naïve Bayes (standard method), etc. Measured area under precision-recall curve (AUC) Our approach wins across the board

Results:Matching Venues on Cora

0.771

0.061

0.342

0.096 0.047

0.339 0.339

0

0.25

0.5

0.75

1

MLN(VP) MLN(ML) MLN(PL) KB CL NB BN

System

AUC

Overview


Relation to Other Approaches

Representation Logical language

Probabilistic language

Markov logic First-order logic Markov nets

RMNs Conjunctive queries

Markov nets

PRMs Frame systems Bayes nets

KBMC Horn clauses Bayes nets

SLPs Horn clauses Bayes nets

Going Further

First-order logic is not enough We can “Markovize” other representations

in the same way Lots to do

Summary

NLP involves relational structure, uncertainty Markov logic combines first-order logic and

probabilistic graphical models Syntax: First-order logic + Weights Semantics: Templates for Markov networks

Inference: MaxWalkSat + KBMC + MCMC Learning: Voted perceptron + PL + ILP Applications to date: Entity resolution, IE, etc. Software: Alchemy

http://www.cs.washington.edu/ai/alchemy

Markov Logic: A Representation Language for Natural Language Semantics

Documents

Transcript of Markov Logic: A Representation Language for Natural Language Semantics