Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms
description
Transcript of Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms
Graph-based Analysis of Semantic Drift in
Espresso-like Bootstrapping Algorithms
EMNLP 2008
23/04/22
Mamoru Komachi†, Taku Kudo‡, Masashi Shimbo† and Yuji Matsumoto†
†Nara Institute of Science and Technology, Japan‡Google Inc.
04/22/232
Bootstrapping and semantic drift Bootstrapping grows small amount of seed instances
Buy # at Apple Store
Generic patterns= Patterns which co-occur with many (irrelevant) instances
Seed instance Extracted instance
iPhone
iPod # for sale
Co-occurrence pattern
iPod
MacBook Air
car
house
04/22/233
Main contributions of this work
3
1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)
2. Solve semantic drift by graph kernels used in link analysis community
Espresso Algorithm [Pantel and Pennacchiotti, 2006]
Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection
Until a stopping criterion is met
04/22/234
Espresso uses pattern-instance matrix Afor ranking patterns and instances|P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns and instances
04/22/235
1 2 ... i ... |I|
1
2
:
p
:
|P|
[A]p,i = pmi(p,i) / maxp,i pmi(p,i)
instance indices
pattern indices
Pattern/instance ranking in Espressop = pattern score vectori = instance score vectorA = pattern-instance matrix
04/22/236
€
p ← 1IAi... pattern ranking
€
i ← 1PATp... instance ranking
Reliable instances are supported by reliable patterns, and vice versa
|P| = number of patterns|I| = number of instances
normalization factorsto keep score vectors not too large
Espresso Algorithm [Pantel and Pennacchiotti, 2006]
Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection
Until a stopping criterion is met
04/22/237
For graph-theoretic analysis,
we will introduce 3 simplifications to
Espresso
First simplification Compute the pattern-instance matrix Repeat
Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection
Until a stopping criterion is met
04/22/238
Simplification 1Remove pattern/instance extraction stepsInstead, pre-compute all patterns and instances once in the beginning of the algorithm
Second simplification Compute the pattern-instance matrix Repeat
Pattern ranking Pattern selection
Instance ranking Instance selection
Until a stopping criterion is met
04/22/239
Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0
Instead,retain scores of all patterns and instances
Third simplification Compute the pattern-instance matrix Repeat
Pattern ranking
Instance ranking
Until a stopping criterion is met
04/22/2310
Until score vectors p and i converge
Simplification 3No early stopping i.e., run until convergence
Input Initial score vector of seed instances Pattern-instance co-occurrence matrix A
Main loopRepeat
Until i and p converge
OutputInstance and pattern score vectors i and p
€
p ← 1IAi... pattern ranking
€
i ← 1PATp... instance ranking
04/22/2311
Simplified Espresso
€
i = 0,1,0,0( )
04/22/2312
Simplified Espresso is HITSSimplified Espresso = HITS in a bipartite graph whose adjacency matrix is A
Problem
No matter which seed you start with, the same instance is always ranked topmostSemantic drift (also called topic drift in HITS)
The ranking vector i tends to the principal eigenvector of ATA as the iteration proceedsregardless of the seed instances!
How about the original Espresso?Original Espresso has heuristics not present in Simplified Espresso
Early stopping Pattern and instance selection
Do these heuristics really help reduce semantic drift?
04/22/2313
Experiments on semantic drift
Does the heuristics in original Espresso help reduce drift?
04/22/2314
04/22/2315
Word sense disambiguation task of Senseval-3 English Lexical Sample
Predict the sense of “bank”
… the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …
In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden
Possibly aligned to water a sort of bank(???) by a rushing river.
Training instances are annotated with their sense
Predict the sense of target word in the test set
04/22/2316
Word sense disambiguation by Original Espresso
Seed instance = the instance to predict its sense
System output = k-nearest neighbor (k=3)Heuristics of Espresso Pattern and instance selection
# of patterns to retain p=20 (increase p by 1 on each iteration)
# of instance to retain m=100 (increase m by 100 on each iteration)
Early stopping
Seed instance
04/22/2317
Convergence process of EspressoHeuristics in Espresso helps reducing
semantic drift(However, early stopping is required for
optimal performance)
Output the most frequent sense regardless of input
Original Espresso
Simplified Espresso
Most frequent sense (baseline)
Semantic drift occurs (always outputs the most frequent sense)
Learning curve of Original Espresso:per-sense breakdown
04/22/2318
# of most frequent sense predictions increases
Recall for infrequent senses worsens even with
original Espresso
Most frequent sense
Other senses
04/22/2319
Summary: Espresso and semantic driftSemantic drift happens because Espresso is designed like HITS HITS gives the same ranking list regardless of seeds
Some heuristics reduce semantic drift Early stopping is crucial for optimal performance
Still, these heuristics require many parameters to be calibrated but calibration is difficult
04/22/2320
Main contributions of this work
20
1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)
2. Solve semantic drift by graph kernels used in link analysis community
Q. What caused drift in Espresso?A. Espresso's resemblance to HITS
HITS is an importance computation method(gives a single ranking list for any seeds)
Why not use a method for another type of link analysis measure - which takes seeds into account?"relatedness" measure
(it gives different rankings for different seeds)
04/22/2321
04/22/2322
The regularized Laplacian kernel A relatedness measure Takes higher-order relations into account
Has only one parameter
€
L = D− AGraph Laplacian
R n ( L)n (I L) 1
n0
Regularized Laplacian matrix
A: adjacency matrix of the graphD: (diagonal) degree matrix
β: parameterEach column of Rβ gives the rankings relative to a node
Evaluation of regularized Laplacian
Comparison to (Agirre et al. 2006)
04/22/2323
algorithm F measure
Most frequent sense (baseline)
54.5
HyperLex 64.6PageRank 64.6Simplified Espresso 44.1Espresso (after convergence)
46.9
Espresso (optimal stopping)
66.5
Regularized Laplacian (β=10-2)
67.104/22/2324
WSD on all nouns in Senseval-3
Outperforms other graph-based methods
Espresso needs optimal stopping to achieve an equivalent
performance
04/22/2325
Conclusions
Semantic drift in Espresso is a parallel form of topic drift in HITS
The regularized Laplacian reduces semantic drift inherently a relatedness measure ( importance measure)
04/22/2326
Future work Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training)
Try other popular tasks of bootstrapping such as named entity extraction Selection of seed instances matters
04/22/2327
Algorithm Most frequent sense
Other senses
Simplified Espresso 100.0 0.0Espresso (after convergence)
100.0 30.2
Espresso (optimal stopping)
94.4 67.4
Regularized Laplacian (β=10-2)
92.1 62.8
04/22/2328
Label prediction of “bank” (F measure)
The regularized Laplacian keeps high recall for infrequent
senses
Espresso suffers from semantic drift
(unless stopped at optimal stage)
Regularized Laplacian is mostly stable across a parameter
04/22/2329
Pattern/instance ranking in EspressoScore for pattern p Score for instance
i
08.10.243030
r (p)
pmi(i, p)max pmi
r(i)
iI
I
r(i)
pmi(i, p)max pmi
r (p)
pP
P
,**,,*,,,
log),(pyxypx
pipmi
p: patterni: instanceP: set of patternsI: set of instancespmi: pointwise mutual informationmax pmi: max of pmi in all the patterns
and instances