Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W....
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W....
Language-Independent Set Expansion of Named Entities using the Web
Richard C. Wang & William W. CohenLanguage Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA
2 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Outline
Introduction System Architecture
FetcherExtractorRanker
Evaluation Conclusion
3 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
What is Set Expansion?
For example, Given a query: {“spit”, “boogers”, “ear wax”} Answer is: {“puke”, “toe jam”, “sweat”, ....}
More formally, Given a small number of seeds: x1, x2, …, xk wh
ere each xi St Answer is a listing of other probable elements: e1, e2, …, en where each ei St
A well-known example of a web-based set expansion system is Google Sets™ http://labs.google.com/sets
4 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
What is it used for?
Derive features for… Named Entity Recognition (Settles, 2004) (Talukdar, 2006)
Expand true named entities in training set Utilize expanded names to assign features to words
Concept Learning (Cohen, 2000) Given a set of instances, look in web pages for tables or lists
that contain some of those instances Automatically extract features from those pages Define features over the instances found
Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005) Extract items from tables or lists that contain given seeds Utilize extracted items and their contexts for learning relation
s
5 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Our Set Expander: SEAL Features
Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, …
Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Learns wrappers on the fly
Based on two research contributions1. Automatic construction of wrappers
Extracts “lists” of entities on semi-structured web pages
2. Use of random graph walk Ranks extracted entities so that those most likely to be in the target
set are ranked higher
Set Expander for Any Language
6 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
System Architecture
Fetcher: download web pages from the Web Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers
1. Canon2. Nikon3. Olympus
4. Pentax5. Sony6. Kodak7. Minolta8. Panasonic9. Casio10. Leica11. Fuji12. Samsung13. …
7 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
The Fetcher
Procedure:1. Compose a search query using all seeds
2. Use Google API to request for top N URLs We use N = 100, 200, and 300 for evaluation
3. Fetch URLs by using a crawler
4. Send fetched documents to the Extractor
8 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
The Extractor
Learn wrappers from web documents and seeds on the fly Utilize semi-structured documents Wrappers defined at character level
No tokenization required; thus language independent
However, very specific; thus page-dependent Wrappers derived from document d is applied to d only
9 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
<li class="toyota"><a href="http://www.geisauto.com/">
<li class="honda"><a href="http://www.curryauto.com/">
<li class="acura"><a href="http://www.curryauto.com/">
<li class="toyota"><a href="http://www.curryauto.com/">
<li class="nissan"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/"> <li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
Extractor E1 finds maximally-long contexts that bracket
all instances of every seed
It seems to be working… but what if I add one more instance of “to
yota”?
It seems to be working too… but how about a more complex exa
mple?
…
…
…
…
…
…
10 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li>
<li class="honda"><a href="http://www.curryauto.com/">
<li class="acura"><a href="http://www.curryauto.com/">
<li class="toyota"><a href="http://www.curryauto.com/">
<li class="nissan"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> <span class="dName">Curry Ford</span>...</li></ul> </li>
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li>
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li>
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li>
I am a noisy entity mention
Me too!
Can you find common contexts that bracket all instances of every seed?
I guess not!
Let’s try out
Extractor E2 and se
e if it works…
Extractor E2 finds
maximally-long contexts that bracket
at least one instance of every seed
Horray! It seems like Extractor E2 works! But how do we get rid of those noisy entity menti
ons?
11 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Extractor: Summary
A wrapper consists of a pair of left (L) and right (R) context string All strings between (but not containing) L and R
are extracted Referred to as “candidate entity mention”
We compared two versions of wrapper: Maximally-long contextual strings that
bracket…1. all instances of every seed (Extractor E1)
2. at least one instance of every seed (Extractor E2)
12 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
The Ranker
Rank candidate entity mentions based on “similarity” to seeds
Noisy mentions should be ranked lower We compare two methods for ranking
1. Extracted Frequency (EF) # of times an entity mention is extracted
2. Random Graph Walk (GW) Probability of an “entity mention” node being rea
ched in a graph (explained in next slide)
13 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Building a Graph
A graph consists of a fixed set of… Node Types: {seeds, document, wrapper, mention} Labeled Directed Edges: {find, derive, extract}
Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic)
“ford”, “nissan”, “toyota”
curryauto.com
Wrapper #3
Wrapper #2
Wrapper #1
Wrapper #4
“honda”26.1%
“acura”34.6%
“chevrolet”22.5%
“bmw pittsburgh”8.4%
“volvo chicago”8.4%
find
derive
extract northpointcars.com
Minkov et al. Contextual Search and Name Disambiguation in Email using Graphs. SIGIR 2006
14 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Legend
Node: x, y, z
Edge Relation: r
An edge from x to y with relation r :
Stop Probability: λ
Random Graph Walk
where
otherwise
if
0
1)(I
zxzx
Probability of picking a target node y given an edge relation r and source n
ode x
“curryauto.com”, ...“wrapper #1”, ...“honda”, “acura”, ...
find, find-1, derive, derive-1, extract, extract-1
Probability of staying at a no
de (0.5)
Probability of picking an edge relation r given a s
ource node x
Probability of reaching any node z from x
yx rRecursive computation of probability
Probability of continuing to node z from x
Probability of staying at node x
ryx
15 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Evaluation Datasets
16 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Evaluation Method Mean Average Precision
Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list
Evaluation Procedure (per dataset)
1. Randomly select three true entities and use their first listed mentions as seeds
2. Expand the three seeds obtained from step 13. Repeat steps 1 and 2 five times4. Compute MAP for the five ranked lists
where L = ranked list of extracted mentions, r = rank
Prec(r) = precision at rank r
(a) Extracted mention at r matches any true mention
(b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r
otherwise
trueare (b) and (a) if
0
1
)(NewEntity r
# True Entities = total number of true entities in this dataset
17 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Experimental Results
Legend
[Extractor] + [Ranker] + [Top N URLs]
Extractor = { E1: Extractor E1, E2: Extractor E2 }
Ranker = { EF: Extracted Frequency, GW: Graph Walk }
N = { 100, 200, 300 }
Overall MAP vs. Various Methods
14.59%
43.76%
82.39%
0%
20%
40%
60%
80%
100%
G.Sets G.Sets (Eng) E1+EF+100
Methods
MA
P (
%)
Overall MAP vs. Various Methods
82.39%
87.61%
93.13%
70%
75%
80%
85%
90%
95%
100%
E1+EF+100 E2+EF+100 E2+GW+100
Methods
MA
P (
%)
Overall MAP vs. Various Methods
93.13% 94.03% 94.18%
70%
75%
80%
85%
90%
95%
100%
E2+GW+100 E2+GW+200 E2+GW+300
Methods
MA
P (
%)
18 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Conclusion & Future Work
Conclusion Unsupervised approach for expanding sets of named entities
Domain and language independent SEAL performs better than Google Sets
Higher Mean Average Precision on our datasets Handle not only English, but also Chinese and Japanese
Future Work Learn from graphs to re-rank extracted mentions Bootstrap named entities by using extracted mentions in previous
expansion as seeds Identify possible class names for expanded sets
i.e. car makers, constellations, presidents…
19 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
References
20 / 20Language Technologies Institute, Carnegie Mellon University
Language-Independent Set ExpansionRichard C. Wang
Top three mentions are the seedsTry it out at http://rcwang.com/seal