Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China...
-
Upload
francine-hoover -
Category
Documents
-
view
245 -
download
0
Transcript of Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China...
Wei Shen†, Jianyong Wang†, Ping Luo‡, Min Wang‡
†Tsinghua University, Beijing, China‡HP Labs China, Beijing, China
WWW 2012
Presented by Tom Chao ZhouJuly 17, 2012
04/18/23 1
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
2/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
3/3404/18/23
MotivationMany large scale knowledge bases have emerged
DBpedia, YAGO, Freebase, and etc.
4/3404/18/23
www.freebase.com
MotivationMany large scale knowledge bases have emerged
DBpedia, YAGO, Freebase, and etc. As world evolves
New facts come into existence Digitally expressed on the Web
Maintaining and growing the existing knowledge basesIntegrating the extracted facts with knowledge
baseChallenge
Name variations “National Basketball Association” “NBA” “New York City” “Big Apple”Entity ambiguity
“Michael Jordan” … …
NBA player
Berkeley professor
5/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
6/3404/18/23
Problem DefinitionEntity linking task
Input: A textual named entity mention m, already
recognized in the unstructured textOutput:
The corresponding real world entity e in the knowledge base
If the matching entity e for entity mention m does not exist in the knowledge base, we should return NIL for m
7/3404/18/23
Entity linking task
Source: From Information to Knowledge:Harvesting Entities and Relationships from Web Sources. PODS’10.
German Chancellor Angela Merkel and her husband Joachim Sauer went to Ulm, Germany.
NIL
Figure 1: An example of YAGO
8/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
9/3404/18/23
Previous MethodsEssential step of entity linking
Define a similarity measure between the text around the entity mention and the document associated with the entity
Bag of words modelRepresent the context as a term vectorMeasure the co-occurrence statistics of termsCannot capture the semantic knowledge
Example:Text: Michael Jordan wins NBA champion.
The bag of words model cannot work well!
10/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
11/3404/18/23
LINDEN FrameworkCandidate Entity Generation
For each named entity mention m Retrieve the set of candidate entities Em
Named Entity DisambiguationFor each candidate entity e∈Em
Define a scoring measureGive a rank to Em
Unlinkable Mention PredictionFor each etop which has the highest score in Em
Validate whether the entity etop is the target entity for mention m
12/3404/18/23
Candidate Entity GenerationIntuitively, the candidates in Em should have
the name of the surface form of m.We build a dictionary that contains vast amount
of information about the surface forms of entitiesLike name variations, abbreviations, confusable
names, spelling variations, nicknames, etc.Leverage the four structures of Wikipedia
Entity pages Redirect pages Disambiguation pages Hyperlinks in Wikipedia articles
13/3404/18/23
Candidate Entity Generation (Cont’)
For each mention mSearch it in the field of surface formsIf a hit is found, we add all target entities of
that surface form m to the set of candidate entities Em
Table 1: An example of the dictionary
14/3404/18/23
Named Entity DisambiguationGoal:
Give a rank to candidate entities according to their scores
Define four featuresFeature 1: Link probability
Based on the count information in the dictionarySemantic network based features
Feature 2: Semantic associativity Based on the Wikipedia hyperlink structure
Feature 3: Semantic similarity Derived from the taxonomy of YAGO
Feature 4: Global coherence Global document-level topical coherence among
entities 15/3404/18/23
Link Probability
Feature 1: link probability LP(e|m) for candidate entity e
where countm(e) is the number of links which point to entity e and have the surface form m
Table 1: An example of the dictionary
0.81
0.05
LP
16/3404/18/23
Semantic Network Construction Recognize all the Wikipedia concepts Γd in the document d
The open source toolkit Wikipedia-Miner1
Example: The Chicago Bulls’ player Michael Jordan won his first NBA championship in
1991. Set of entity mentions: {Michael Jordan, NBA} Candidate entities:
Michael Jordan {Michael J. Jordan, Michael I. Jordan} NBA {National Basketball Association, Nepal Basketball Association}
Γd : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls} Hyperlink structure of Wikipedia articles Taxonomy of concepts in YAGO
1http://wikipedia-miner.sourceforge.net/index.htm
Figure 2: An example of the constructed semantic network
17/3404/18/23
Semantic AssociativityFeature 2: semantic associativity SA(e) for
each candidate entity e
Figure 2: An example of the constructed semantic network
18/3404/18/23
Semantic Associativity (Cont’)Given two Wikipedia concepts e1 and e2
Wikipedia Link-based Measure (WLM) [1]Semantic associativity between them
where E1 and E2 are the sets of Wikipedia concepts that hyperlink to e1 and e2 respectively, and W is the set of all concepts in Wikipedia
[1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of WIKIAI, 2008.
19/3404/18/23
Semantic SimilarityFeature 3: semantic similarity SS(e) for each
candidate entity e
where Θk is the set of k context concepts in Γd which have the highest semantic similarity with entity e
Figure 2: An example of the constructed semantic network
k=2
20/3404/18/23
Semantic Similarity (Cont’)Given two Wikipedia concepts e1 and e2
Assume the sets of their super classes are Φe1 and Φe2 For each class C1 in the set Φe1
Assign a target class ε(C1) in another set Φe2 as
Where sim(C1, C2) is the semantic similarity between two classes C1 and C2
To compute sim(C1, C2) Adopt the information-theoretic approach introduced in
[2]
Where C0 is the lowest common ancestor node for class nodes C1 and C2 in the hierarchy, P(C) is the probability that a randomly selected object belongs to the subtree with the root of C in the taxonomy.
[2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998. 21/3404/18/23
Semantic Similarity (Cont’)Calculate the semantic similarity from one
set of classes Φe1 to another set of classes Φe2
Define the semantic similarity between Wikipedia concepts e1 and e2
22/3404/18/23
Global CoherenceFeature 4: global coherence GC(e) for each
candidate entity eMeasured as the average semantic associativity of
candidate entity e to the mapping entities of the other mentions
where em’ is the mapping entity of mention m’Substitute the most likely assigned entity for the
mapping entity in Formula 9
The most likely assigned entity e’m’ for mention m is defined as the candidate entity which has the maximum link probability in Em
23/3404/18/23
Global Coherence (Cont’)
Figure 2: An example of the constructed semantic network
24/3404/18/23
Candidates RankingTo generate a feature vector Fm(e) for each e ∈ Em
To calculate Scorem(e) for each candidate e
where is the weight vector which gives different weights for each feature element in Fm(e)
Rank the candidates and pick the top candidate as the predicted mapping entity for mention m
To learn , we use a max-margin technique based on the training data setAssume Scorem(e∗) is larger than any other Scorem(e)
with a margin
We minimize over ξm ≥ 0 and the objective25/3404/18/23
Unlinkable Mention PredictionPredict mention m as an unlinkable mention
If the size of Em generated in the Candidate Entities Generation module is equal to zero
If Scorem(etop) is smaller than the learned threshold τ
26/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
27/3404/18/23
Experiment SetupData sets
CZ data set: newswire data used by Cucerzan [3]
TAC-KBP2009 data set: used in the track of Knowledge Base Population (KBP) at the Text Analysis Conference (TAC) 2009
Parameters learning:10-fold cross validation
[3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, pages 708–716, 2007.
28/3404/18/23
Results over the CZ data set
29/3404/18/23
Results over the CZ data set
30/3404/18/23
Results on the TAC-KBP2009 data set
31/3404/18/23
Results on the TAC-KBP2009 data set
32/3404/18/23
OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion
33/3404/18/23
ConclusionLINDEN
A novel framework to link named entities in text with YAGO
Leveraging the rich semantic knowledge derived from the Wikipedia and the taxonomy of YAGO
Significantly outperforms the state-of-the-art methods in terms of accuracy
34/3404/18/23
Thanks!Q&A
35/3404/18/23