First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department...
-
Upload
christian-morris -
Category
Documents
-
view
217 -
download
0
Transcript of First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department...
![Page 1: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/1.jpg)
First-Order Probabilistic Models for Coreference Resolution
Aron Culotta
Computer Science Department
University of Massachusetts Amherst
Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
![Page 2: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/2.jpg)
Probabilistic First-Order Logic for Coreference Resolution
Aron Culotta
Computer Science Department
University of Massachusetts, Amherst
Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
![Page 3: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/3.jpg)
Previous work: Conditional Random Fields
for Coreference
![Page 4: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/4.jpg)
A Pairwise Conditional Random Field for Coreference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
y
[McCallum & Wellner, 2003, ICML](PW-CRF)
y
y
x2
x3
x1
Coreferent(x2,x3)?
![Page 5: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/5.jpg)
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
y
[McCallum & Wellner, 2003, ICML](PW-CRF)
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
x2
x3
x1
![Page 6: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/6.jpg)
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30y
[McCallum & Wellner, 2003, ICML](PW-CRF)
11
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
Pairwise compatibility score learned from
training data
x2
x3
x1
![Page 7: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/7.jpg)
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30y
[McCallum & Wellner, 2003, ICML](PW-CRF)
11
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
Pairwise compatibility score learned from
training data Hard transitivity constraints enforced by prediction algorithm
x2
x3
x1
![Page 8: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/8.jpg)
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30
11
Prediction in PW-CRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
€
log P(y | x)( )∝ λ l f l (x i,x j ,y ij )l
∑i, j
∑ = w ij
i, j w/inparitions
∑ − w ij
i, j acrossparitions
∑ = 64
Often approximated with agglomerative clustering
x2
x3
x1
![Page 9: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/9.jpg)
Parameter Estimation in PW-CRFs
• Given labeled documents, generate all pairs of mentions– Optionally prune distant mention pairs
[Soon, Ng, Lim 2001]
• Learn binary classifier to predict coreference
• Edge weights proportional to classifier output
![Page 10: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/10.jpg)
Sometimes pairwise comparisons are insufficient
• Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
a pair of name strings where edit distance differs > 0.5?
• Maximum distance between mentions in document
• A entity contains only pronoun mentions?
We need measures on hypothesized “entities”We need First-order logic
![Page 11: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/11.jpg)
First-Order Logic CRFs for Coreference
![Page 12: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/12.jpg)
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
56
x2
x3
x1
Coreferent(x1,x2,x3)?
![Page 13: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/13.jpg)
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
Clusterwise compatibility score learned from training data
Features are arbitrary FOL predicates over a set of mentions
€
P(y | x) =1
Zxexp λ l f l (Xi,y i)
l
∑X i ∈Ρ(x )
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
56
x2
x3
x1
Coreferent(x1,x2,x3)?
![Page 14: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/14.jpg)
€
P(y | x) =1
Zxexp λ l f l (Xi,y i)
l
∑X i ∈Ρ(x )
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
As in PW-CRF, prediction can be approximated with agglomerative clustering
56Coreferent(x1,x2,x3)?
x2
x3
x1
![Page 15: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/15.jpg)
Learning Parameters of FOL-CRFs
• Generate classification examples where input is a set of mentions
• Unlike Pairwise CRF, cannot generate all possible examples in training data
![Page 16: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/16.jpg)
He Powell Rice She he Secretary
Coreferent(x1,x2) …
Coreferent(x1,x2 ,x3) …
Coreferent(x1,x2 ,x3,x4)
Coreferent(x1,x2 ,x3,x4 ,x5)
Coreferent(x1,x2 ,x3,x4 ,x5 ,x6)
…
…
…
. . .
. . .
Combinatorial Explosion!
Learning Parameters of FOL-CRFs
![Page 17: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/17.jpg)
This space complexity is common in probabilistic first-order logic
Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003
Richardson & Domingos 2006
![Page 18: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/18.jpg)
Training in Probabilistic FOLParameter estimation; weight learning
• Input– First-order formulae
x S(x) T(x)
– Labeled data• a, b, c S(a), T(a), S(b), T(b), S(c)
• Output– Weights for each formula
x S(x) T(x) [0.67]
xy Coreferent(x,y) Pronoun(x)
xy Coreferent(x,y) Pronoun(x) [-2.3]
![Page 19: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/19.jpg)
Training in Probabilistic FOLPrevious Work
• Maximum likelihood– Require intractable normalization constant
• Pseudo-likelihood [Richardson, Domingos 2006]
– Ignores uncertainty of relational information
• E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997]
• Sampling [Paskin 2002]
• Perceptron [Singla, Domingos 2005]
– Can be inefficient when prediction is expensive
• Piecewise training [Sutton, McCallum 2005]
– Train “pieces” of world in isolation– Performance sensitive to which pieces are chosen
![Page 20: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/20.jpg)
• Most methods require “unrolling” [grounding]
• Unrolling has exponential space complexity– E.g., xyz S(x,y,z) -> T(x,y,z)
• For constants [a b c d e f g h] must examine all triples
• Sampling can be inefficient due to large sample space.
• Proposal: Let prediction errors guide sampling
Training in Probabilistic FOLParameter estimation; weight learning
![Page 21: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/21.jpg)
Error-driven Training
• Input– Observed data X // Input mentions
– True labeling P // True clustering
– Prediction algorithm A // Clustering algorithm
– Initial weights W, prediction Q // Initial clustering
• Iterate until convergence– Q’ A(Q, W, O) // Merge clusters – If Q’ introduces an error
• UpdateWeights(Q, Q’, P, O, W)
– Else Q Q’
![Page 22: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/22.jpg)
UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions
• Using truth P, generate a new Q’’ that is a better modification of Q than Q’.
• Update W s.t. Q’’ A(Q, W, O)
• Update parameters so Q’’ is ranked higher than Q’
![Page 23: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/23.jpg)
Ranking vs Classification Training
• Instead of training
[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO
• ...Rather...
[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]
• In general, higher-ranked example may contain errors
[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
![Page 24: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/24.jpg)
Ranking Parameter Update
In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003]
Wt+1 = argminW ||Wt - W|| s.t. Score(Q’’, W) - Score(Q’, W) ≥ 1
![Page 25: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/25.jpg)
Advantages• Never need to unroll entire network
– Only explore partial solutions prediction algorithm likely to produce
• Weights tuned for prediction algorithm
• Adaptable to different prediction algorithms– beam search, simulated annealing, etc.
• Adaptable to different loss functions
Related:• Incremental Perceptron [Collins, Roark 2004]
• LaSO [Daume, Marcu 2005]
Extended here for FOL, ranking, max-margin loss.
Rank partial, possibly mistaken predictions.
![Page 26: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/26.jpg)
Disadvantages
• Difficult to analyze exactly what global objective function is being optimized
• Convergence issues– Average weight updates
![Page 27: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/27.jpg)
Experiments
• ACE 2004 coreference– 443 newswire documents
• Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002]
– Text match, gender, number, context, Wordnet
• Additional first-order features– Min/Max/Average/Majority of pairwise features
• E.g., Average string edit distance, Max document distance
– Existential/Universal quantifications of pairwise features• E.g., There exists gender disagreement
• Prediction: Greedy agglomerative clustering
![Page 28: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/28.jpg)
Experiments
Sampling + Classification
Error-driven + Ranking
FOL-CRF 69.2 79.3
PW-CRF 62.4 72.5
B-Cubed F1 Score on ACE 2004 Noun Coreference
[to our knowledge, best previously reported results ~ 69% (Ng, 2005)]
Better Representation
Better Training
![Page 29: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/29.jpg)
Conclusions
Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP.
Simple approximations can make these approaches practical for real-world problems.
![Page 30: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/30.jpg)
Future Work
• Fancier features – Over entire clusterings
• Less greedy inference – Metropolis-Hastings sampling
• Analysis of training– Which positive/negative examples to select when
updating– Loss function sensitive to local minima of prediction
• Analyze theoretical/empirical convergence
![Page 31: First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649f315503460f94c4c5f1/html5/thumbnails/31.jpg)
Thank you