Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China...

Wei Shen†, Jianyong Wang†, Ping Luo‡, Min Wang‡

†Tsinghua University, Beijing, China‡HP Labs China, Beijing, China

WWW 2012

Presented by Tom Chao ZhouJuly 17, 2012

04/18/23 1

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

2/3404/18/23


3/3404/18/23

MotivationMany large scale knowledge bases have emerged

DBpedia, YAGO, Freebase, and etc.

4/3404/18/23

www.freebase.com

MotivationMany large scale knowledge bases have emerged

DBpedia, YAGO, Freebase, and etc. As world evolves

New facts come into existence Digitally expressed on the Web

Maintaining and growing the existing knowledge basesIntegrating the extracted facts with knowledge

baseChallenge

Name variations “National Basketball Association” “NBA” “New York City” “Big Apple”Entity ambiguity

“Michael Jordan” … …

NBA player

Berkeley professor

5/3404/18/23


6/3404/18/23

Problem DefinitionEntity linking task

Input: A textual named entity mention m, already

recognized in the unstructured textOutput:

The corresponding real world entity e in the knowledge base

If the matching entity e for entity mention m does not exist in the knowledge base, we should return NIL for m

7/3404/18/23

Entity linking task

Source: From Information to Knowledge:Harvesting Entities and Relationships from Web Sources. PODS’10.

German Chancellor Angela Merkel and her husband Joachim Sauer went to Ulm, Germany.

NIL

Figure 1: An example of YAGO

8/3404/18/23


9/3404/18/23

Previous MethodsEssential step of entity linking

Define a similarity measure between the text around the entity mention and the document associated with the entity

Bag of words modelRepresent the context as a term vectorMeasure the co-occurrence statistics of termsCannot capture the semantic knowledge

Example:Text: Michael Jordan wins NBA champion.

The bag of words model cannot work well!

10/3404/18/23


11/3404/18/23

LINDEN FrameworkCandidate Entity Generation

For each named entity mention m Retrieve the set of candidate entities Em

Named Entity DisambiguationFor each candidate entity e∈Em

Define a scoring measureGive a rank to Em

Unlinkable Mention PredictionFor each etop which has the highest score in Em

Validate whether the entity etop is the target entity for mention m

12/3404/18/23

Candidate Entity GenerationIntuitively, the candidates in Em should have

the name of the surface form of m.We build a dictionary that contains vast amount

of information about the surface forms of entitiesLike name variations, abbreviations, confusable

names, spelling variations, nicknames, etc.Leverage the four structures of Wikipedia

Entity pages Redirect pages Disambiguation pages Hyperlinks in Wikipedia articles

13/3404/18/23

Candidate Entity Generation (Cont’)

For each mention mSearch it in the field of surface formsIf a hit is found, we add all target entities of

that surface form m to the set of candidate entities Em

Table 1: An example of the dictionary

14/3404/18/23

Named Entity DisambiguationGoal:

Give a rank to candidate entities according to their scores

Define four featuresFeature 1: Link probability

Based on the count information in the dictionarySemantic network based features

Feature 2: Semantic associativity Based on the Wikipedia hyperlink structure

Feature 3: Semantic similarity Derived from the taxonomy of YAGO

Feature 4: Global coherence Global document-level topical coherence among

entities 15/3404/18/23

Link Probability

Feature 1: link probability LP(e|m) for candidate entity e

where countm(e) is the number of links which point to entity e and have the surface form m

Table 1: An example of the dictionary

0.81

0.05

LP

16/3404/18/23

Semantic Network Construction Recognize all the Wikipedia concepts Γd in the document d

The open source toolkit Wikipedia-Miner1

Example: The Chicago Bulls’ player Michael Jordan won his first NBA championship in

1991. Set of entity mentions: {Michael Jordan, NBA} Candidate entities:

Michael Jordan {Michael J. Jordan, Michael I. Jordan} NBA {National Basketball Association, Nepal Basketball Association}

Γd : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls} Hyperlink structure of Wikipedia articles Taxonomy of concepts in YAGO

1http://wikipedia-miner.sourceforge.net/index.htm

Figure 2: An example of the constructed semantic network

17/3404/18/23

Semantic AssociativityFeature 2: semantic associativity SA(e) for

each candidate entity e


18/3404/18/23

Semantic Associativity (Cont’)Given two Wikipedia concepts e1 and e2

Wikipedia Link-based Measure (WLM) [1]Semantic associativity between them

where E1 and E2 are the sets of Wikipedia concepts that hyperlink to e1 and e2 respectively, and W is the set of all concepts in Wikipedia

[1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of WIKIAI, 2008.

19/3404/18/23

Semantic SimilarityFeature 3: semantic similarity SS(e) for each

candidate entity e

where Θk is the set of k context concepts in Γd which have the highest semantic similarity with entity e


k=2

20/3404/18/23

Semantic Similarity (Cont’)Given two Wikipedia concepts e1 and e2

Assume the sets of their super classes are Φe1 and Φe2 For each class C1 in the set Φe1

Assign a target class ε(C1) in another set Φe2 as

Where sim(C1, C2) is the semantic similarity between two classes C1 and C2

To compute sim(C1, C2) Adopt the information-theoretic approach introduced in

[2]

Where C0 is the lowest common ancestor node for class nodes C1 and C2 in the hierarchy, P(C) is the probability that a randomly selected object belongs to the subtree with the root of C in the taxonomy.

[2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998. 21/3404/18/23

Semantic Similarity (Cont’)Calculate the semantic similarity from one

set of classes Φe1 to another set of classes Φe2

Define the semantic similarity between Wikipedia concepts e1 and e2

22/3404/18/23

Global CoherenceFeature 4: global coherence GC(e) for each

candidate entity eMeasured as the average semantic associativity of

candidate entity e to the mapping entities of the other mentions

where em’ is the mapping entity of mention m’Substitute the most likely assigned entity for the

mapping entity in Formula 9

The most likely assigned entity e’m’ for mention m is defined as the candidate entity which has the maximum link probability in Em

23/3404/18/23

Global Coherence (Cont’)


24/3404/18/23

Candidates RankingTo generate a feature vector Fm(e) for each e ∈ Em

To calculate Scorem(e) for each candidate e

where is the weight vector which gives different weights for each feature element in Fm(e)

Rank the candidates and pick the top candidate as the predicted mapping entity for mention m

To learn , we use a max-margin technique based on the training data setAssume Scorem(e∗) is larger than any other Scorem(e)

with a margin

We minimize over ξm ≥ 0 and the objective25/3404/18/23

Unlinkable Mention PredictionPredict mention m as an unlinkable mention

If the size of Em generated in the Candidate Entities Generation module is equal to zero

If Scorem(etop) is smaller than the learned threshold τ

26/3404/18/23


27/3404/18/23

Experiment SetupData sets

CZ data set: newswire data used by Cucerzan [3]

TAC-KBP2009 data set: used in the track of Knowledge Base Population (KBP) at the Text Analysis Conference (TAC) 2009

Parameters learning:10-fold cross validation

[3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, pages 708–716, 2007.

28/3404/18/23

Results over the CZ data set

29/3404/18/23

Results over the CZ data set

30/3404/18/23

Results on the TAC-KBP2009 data set

31/3404/18/23

Results on the TAC-KBP2009 data set

32/3404/18/23


33/3404/18/23

ConclusionLINDEN

A novel framework to link named entities in text with YAGO

Leveraging the rich semantic knowledge derived from the Wikipedia and the taxonomy of YAGO

Significantly outperforms the state-of-the-art methods in terms of accuracy

34/3404/18/23

Thanks!Q&A

35/3404/18/23

Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China...

Documents

Transcript of Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China...