Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee...

60
Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee [email protected]

Transcript of Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee...

Page 1: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Name Disambiguation in Digital Libraries

The Pennsylvania State University

Dongwon Lee

[email protected]

Page 2: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 2

Outline

Warm-Up Motivation & Problem Def. Disambiguation by Graphs Disambiguation by Groups Disambiguation by Googling Conclusion

Page 3: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 3

Penn State University

Page 4: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 4

Penn State University

State College, PA Out of nowhere,

but close to everywhere

West: 2.5 hours to Pittsburgh

East: 4 hours to New York

South: 3 hours to Washington DC

North: 3 hours to Buffalo

Page 5: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 5

Penn State University

Founded in 1855 23 campuses throughout PA state Main campus at State College, PA 84,000 students, 20,800 faculty $1.2 billion endowment “Nittany Lion” Penn State ≠ U. Penn

Two CompSci-related divisions: Dept. of Computer Science & Engineering (CSE) College of Info. Sciences & Technology (IST)

Page 6: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 6

Penn State University

5 DL/DB Faculty

CSE: Wang-Chien Lee

IST: C. Lee Giles Dongwon Lee Prasenjit Mitra James Wang

Active Collaboration

BLAST

Page 7: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 7

Penn State University

In 2005, IST hired a faculty from NUS Dr. Heng Xu

In 2007, plan to hire 1-2 faculty on Security Risk Analysis Data Mining

Encourage to apply http://ist.psu.edu/ist/facultyrecruiting/

Page 8: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 8

QUAGGA Project

Data Cleaning project @ Penn State http://pike.psu.edu/quagga/

Goals: Scalable Semantic and context-aware DB-centric system-building

Page 9: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 9

QUAGGA Project

This talk is mainly based on: “Group Linkage”, ICDE 2007 “Are Your Citations Clean? New Challenges and Scenarios

in Maintaining Digital Libraries”, ACM CACM 2007 “Improving Grouped-Entity Resolution using Quasi-

Cliques”, ICDM 2006 “Googled Name Linkage”, Penn State TR, 2006 “Search Engine Driven Author Name Disambiguation”,

JCDL 2006

Slides for this talk are available at: http://pike.psu.edu => talk

Page 10: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 10

Credits

Students Ergin Elmacioglu (Penn State, USA) Yee Tan Fan (NUS, Singapore) Byung-Won On (Penn State, USA)

Collaborators C. Lee Giles (Penn State, USA) Min-Yen Kan (NUS, Singapore) Jaewoo Kang (Korea U., Korea) Nick Koudas (U. Toronto, Canada) Prasenjit Mitra (Penn State, USA) Jian Pei (Simon Fraser U., Canada) Divesh Srivastava (AT&T Labs – Research, USA) Yi Zhang (UC. Santa Cruz, USA)

Page 11: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 11

Outline

Warm-Up Motivation & Problem Def. Disambiguation by Graphs Disambiguation by Groups Disambiguation by Googling Conclusion

Page 12: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 12

Eg. ACM DL Portal

Jeffrey D. Ullman@ Stanford Univ.

Page 13: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 13

Eg. DBLP

Page 14: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 14

1. U. Western Ontario2. Fudan University3. U. New South

Wales4. UNC, Chapel Hill

Eg. DBLP

Page 15: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 15

Eg. DBLP

Page 16: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 16

Eg. WWW

Page 17: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 17

Eg. People Names

Most common activities of Internet users ~ 30% of search engine queries include person names (R.

Guha et al., WWW 2004)

Highly ambiguous only 90,000 different names for 100 million people (U.S.

Census Bureau) Valid changes:

Customs: Lee, Dongwon vs. Dongwon Lee vs. LEE Dongwon

Marriage: Carol Dusseau vs. Carol Arpaci-Dusseau

Misc.: Sean Engelson vs. Shlomo Argamon

Results: mixture of web pages or query results about different

people with the same name

Page 18: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 18

Eg. IMDB & Wikipedia

Page 19: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 19

Eg. Product Names

Automobile models Honda Fix vs. Honda Jazz

Companies T-Fal vs. Tefal

Electronic devices Apple iPod Nano 4GB vs. 4GB iPod nano 4GB Apple iPhone vs. Canadian iPhone

Page 20: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 20

Eg. Drug Names

Confusion due to look-alike or sound-alike drug names: Primaxin (antibiotic inject.) – Primacor

(hypertension inject.) Amaryl – Amikin, Flomax – Volmax, Zantac –

Xanax 44,000 – 98,000 fatalities each year

Institute of Medicine Report, 1999 Automatic identification of similar drug names

has an important implication

Page 21: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 21

Name Disambiguation Problem

When names of entities (eg, people, products, companies, drugs) are: Mixed sort them out Split link them out

Name Disambiguation Problem: The process of detecting and correcting ambiguous named entities that represent the same real-world object

Page 22: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 22

Terminology Entity: real-world object (eg, person, product, drug,

company, etc) We view that Entity has two main information

name: textual description of the entity contents: metadata or contents describing the entity

Eg.

Page 23: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 23

Landscape

Abundant research on related problems Split names

DB: approximate join, merge/purge, record linkage DL: citation matching AI: identity uncertainty LIS: name authority control

Mixed names DM: k-way clustering DL: author name disambiguation NLP: word sense disambiguation IR: query results grouping

Page 24: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 24

Landscape

In a nutshell, existing approaches often do: For two entities, e1 and e2, capture their

information in data structures, D(e1) and D(e2) Measure the distance or similarity between data

structures: dist(D(e1), D(e2)) = d Determine for matching:

If d < threshold, then e1 and e2 are matching entities

Work well for common applications Ours do name disambiguation better when

Entities have structures that we can exploit, or Entities lack useful information

Page 25: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 25

Outline

Warm-Up Motivation & Problem Def. Disambiguation by Graphs Disambiguation by Groups Disambiguation by Googling Conclusion

Page 26: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 26

Key Idea

When entities have contents that can be captured as graphs, let’s exploit it

In DL, entities often have A set of co-authors to work with A set of venues to submit to A set of topics to work on

If we capture these information as graphs, it may yield better results than using simple distance

Page 27: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 27

Using Graphs

a d e

e a k

a k g

z a

a y

x a

c1

c2

c3

c4

c5

c6

g k f

f k e

e d f

f q

t f

f r

c1

c2

c3

c4

c5

c6

p fc7a gc7

c d e

x c t

g c

z c

c y

j c

c1

c2

c3

c4

c5

c6

a f c

name variant of a

a = {d, e, k, g, z, y, x} f = {d, e, k, g, p, r, t, q} c = {d, e, x, g, z, y, j, t}

Jaccard (a, f) = 4/11 = 0.36

Jaccard (a, c) = 6/10 = 0.6 (<= name variant of a)

False positive problemV(a) V(d)

V(a) V(e)

V(d) V(e)

V(y)V(a)

V(y)

V(z)

V(d)

V(e)

V(k)

V(g)

V(x)

Graph(a) Graph(c)

V(c)

V(g)V(t)

V(x)

V(d)

V(j)

V(z)

V(f)

V(d)V(k)

V(g)

V(p) V(r)

V(t)

V(q)

Graph(f)

V(e)

V(e)

V(y)

Our graph-based approach: Overcome the limitation of existing distance metrics Unearth the hidden relationships in contents Use Quasi-Clique to measure the strong relations

False Positive Problem:

Page 28: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 28

Using Graphs

Represent entity e1 as graph g1 using common tokens Author: co-author Venue: common venues Title: common keywords

Superimpose the graph g1 onto base graph B1 to get a final graph representation G1 Author: entire collaboration graph as B1 Venue: entire venue similarity graph as B1 Title: entire token co-occurrence graph B1

Measure the similarity of two entities e1 and e2 w.r.t. G1 and G2

Page 29: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 29

Superimposition

Page 30: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 30

Quasi-Clique Graph G

V(G): set of vertices E(G): set of edges

Г-quasi-complete-graph (0< Г≤1) Every vertex in G has at least Г*(|V(G)|-1) degrees

V(S) ( V(G)) G(S): Г-Quasi-Clique

=> If V(S) forms the graph satisfying Г-quasi-complete-graph

G(S): Clique

=> If Г=1 Use Quasi-Clique (QC) to measure contextual distances

E.g., Function QC(G(a), G(b), Г=0.3, S=3)

V(a)

V(y)

V(z)

V(d)

V(e)

V(k)

V(g)

V(x)

G(a)

V(b)

V(d)V(k)

V(g)

V(p) V(r)

V(t)

V(q)

G(b)

V(e)

Page 31: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 31

Experimental Validation

JC : Jaccard similarityJC+QC : JC + Quasi-CliqueTI : TF/IDF Cosine similarityTI+QC : TI + Quasi-CliqueIC : IntelliClean (venue hierarchy)IC+QC : IC + Quasi-Clique

Precision:

• k results are returned

• r of k are name variants

• precision = r / k

Page 32: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 32

IMDB Synthetic Dataset

Page 33: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 33

Outline

Warm-Up Motivation & Problem Def. Disambiguation by Graphs Disambiguation by Groups Disambiguation by Googling Conclusion

Page 34: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 34

Key Idea

Graph is a rich data structure Can capture a wealth of information But expensive to manipulate

Better data structure than Graphs Groups When entities have a group of elements Authors with citations, Images with m x n grids

Page 35: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 35

Popular Group Similarity

Jaccard

Bipartite Matching Cardinality Weighted

Clustering Single vs. Complete vs.

Average Link

21

2121 ),(

gg

ggggsim

Page 36: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 36

Intuition for better similarity

Two groups are similar if: There is high enough

similarity between matching pairs of individual elements that constitute the two groups

A large fraction of elements in the two groups form matching element pairs

Page 37: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 37

Group similarity

Two groups of elements: g1 = {r11, r12, …, r1m1}, g2 = {r21, r22, …, r2m2} The group measure BM is the normalized

weight of the maximum bipartite matching M in the bipartite graph (N = g1 U g2, E=g1 X g2)

such that

Mmm

rrsimggBM

Mrr ji

simji

21

),( 212,1,

21)),((

)(

)( 2,1 ji rrsim

21

2121 ),(

gg

ggggsim

Page 38: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 38

Challenges

Large number of groups to match O(NM)

BM uses maximum weight bipartite matching Bellman-Ford: O(V2E) Hungarian: O(V3)

… …

Page 39: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 39

Solution: Greedy matching

Bipartite matching computation is expensive because of the requirement No node in the bipartite graph can have more

than one edge incident on it Let’s relax this constraint:

For each element ei in g1, find an element ej in g2 with the highest element-level similarity S1

For each element ej in g2, find an element ei in g1 with the highest element-level similarity S2

Page 40: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 40

Upper/Lower Bounds

2121

21),( 212,1,

21)),((

)(SSmm

rrsimggUB

SSrr ji

simji

2121

21),( 212,1,

21)),((

)(SSmm

rrsimggLB

SSrr ji

simji

Mmm

rrsimggBM

Mrr ji

simji

21

),( 212,1,

21)),((

)(

Page 41: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 41

Theorem & Algorithm

ELSE IF LB(g1,g2) ≥ θ → BM(g1,g2) ≥ θ → g1≈ g2

ELSE, compute BM(g1,g2) This step is expensive

IF UB(g1,g2) < θ → BM(g1,g2) < θ → g1 ≠ g2

)()( 2,1,2,1, ggBMggLB simsim

)()( 2,1,2,1, ggUBggBM simsim

Goal: BM(g,gi) ≥ θ

Page 42: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 42

ExperimentLeft: 300 groupsRight: 700,000 groups

Page 43: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 43

ExperimentLeft: 100 groupsRight: 700,000 groups

UB(10)|BM(k)

Page 44: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 44

Outline

Warm-Up Motivation & Problem Def. Disambiguation by Graphs Disambiguation by Groups Disambiguation by Googling Conclusion

Page 45: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 45

Key Idea

When entities have a wealth of information, we can exploit them by capturing them as either Graphs or Groups

But when entities do not have a wealth of information or have only noisy information, then what to do?

Ask people what they think

Page 46: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 46

Hypothesis

Use the Web as a collective knowledge of people

Hypothesis:

If an entity e1 is a duplicate of another entity e2, and if e1 frequently appears together with information I on the Web, then e2 may appear frequently with I on the Web, too.

Page 47: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 47

Eg. ACM DL Case

Search results from Google:

“Jeffrey D. Ullman” 384,000 pages “Jeffrey D. Ullman” + “aho” 174,000 pages

“J. Ullman” 124,000 pages “J. Ullman” + “aho” 41,000 pages

“Shimon Ullman” 27,300 pages “Shimon Ullman” + “aho” 66 pages

45%

33%

0%

Page 48: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 48

John Doe

V1= […, …, …]

V2= [..., …, …]

V3= […, …, …]

………………

………………

………………

data

piece

J. Doe

………………

………………

………………

f(x)

John Doe

J. Doe

the Web the WebJohn DoeSearch Results:

78 pages http://-- http://--

John DoeSearch Results:

78 pages http://-- http://--

J. DoeSearch Results:

91 pages http://-- http://--

J. DoeSearch Results:

91 pages http://-- http://--

dist(…)

Query I

Query II

Googled Name Linkage

Page 49: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 49

Step 1. Select representative data

What to select A single token “aho” A key phrase “stanford professor” A sentence or more?

How to select tf, tf*idf, latent topic models, …

How many to select 1, 2, … n

Where to select from? Contents of canonical entity, variant, both

Page 50: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 50

Step 2. Acquire the collective knowledge

How to form the query? Single information “I” (the most important data piece)

“J. D. Ullman” AND “Aho”

Multiple information “I1”, “I2”, “I3”, … (the most k important data pieces)

Conjuction or Disjunction or Hybrid

“J. D. Ullman” AND “Aho” AND/OR “database” AND/OR “vldb”…

Formal evaluation of the effectiveness of such variations Different heuristics based on

Availability, discriminative power of the data content Popularity of the name, variants, other candidates

Page 51: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 51

Step 3. Interpret the collective knowledge

For entities ec, ei and information tc

Page Count

URLs

Web Page Contents

Jeffrey D. UllmanJ. Ullman

=1/(174,000 - 41,000)

Jeffrey D. UllmanShimon Ullman

=1/(174,000 - 66)

Jeffrey D. UllmanJ. Ullman

portal.acm.orginfolab.stanford.eduen.wikipedia.orgtheory.lcs.mit.edu

= 4/16

Jeffrey D. UllmanShimon. Ullman

portal.acm.org

= 1/19

Page 52: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 52

Web Page Contents Use top-k returned Web pages for each entity Two alternatives for sim(ec, ei):

Group distance between two sets of top-k web pages Represent each set by a single Virtual Document

Apply document comparison metrics on Virtual Doc.

Heuristics for creating Virtual documents:

Step 3. Interpret the collective knowledge

Page 53: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 53

Web Page Contents sim(ec, ei) = doc_sim( vdoc(ec), vdoc(ei) ) Document Similarity metrics:

Step 3. Interpret the collective knowledge

Page 54: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 54

Results with URL and Host

ACM data set: 43 authors

14.2 citations/author 21 candidates/block

3.1 citations/candidate 1.8 name variants/block

6.7 citations/variant

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

coauthors venues

attributes

reca

ll

jaccard

googling-1

googling-2

googling host10

Recall:

• k duplicates

• k results are returned

• r are correct name variants

• recall = r / k

metric token selection comparison method

googling-1 tf page count

googling-2 tf*idf page count

gog. host10 tf*idf top 10 host names

18% 55%

Page 55: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 55

Results with Web Pages

ACM data set: 43 authors

14.2 citations/author 21 candidates/block

3.1 citations/candidate 1.8 name variants/block

6.7 citations/variant

29%

84%

Page 56: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 56

Results with Web Pages (cont)

IMDB data set: 50 actors

24 titles/entity 20 candidates/block

24 titles/candidate 1 name variant/block

23.5 titles/variant

193% improvement

Page 57: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 57

Scalability Not scalable:

A large number of Web accesses Network traffic, load of search engine and web sites

Solutions: Local snapshot of the Web

Stanford WebBase Project ~100 million web pages from >50,000 sites including many .edu domains Downloaded the half of the data & filtered Local snapshot containing 3.5 million relevant pages

Page 58: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 58

Conclusion

Name-related problems are common Three disambiguation techniques

By Graphs By Groups By Googling

Helps when entities Have structures to exploit, or Lack useful information

Page 59: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 59

Conclusion

More research needed Inputs from AI, NLP, DB, DL

Task #13: Web People Search Task

http://nlp.uned.es/weps/

Page 60: Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee dongwon@psu.edu.

Dongwon Lee / NUS 2006 60

Conclusion

Thank You !

http://pike.psu.edu/