Similarity in Wikipedia Articles (EDBT Summer School)

12
Similarity in Wikipedia Articles Badenes, Carlos (cbadenes) Garijo, Daniel (dgarijo) Priyatna, Freddy (fpriyatna) {*}@fi.upm.es EDBT Summer School 2015

Transcript of Similarity in Wikipedia Articles (EDBT Summer School)

Page 1: Similarity in Wikipedia Articles (EDBT Summer School)

Similarity in Wikipedia Articles

Badenes, Carlos (cbadenes) Garijo, Daniel (dgarijo)

Priyatna, Freddy (fpriyatna) {*}@fi.upm.es

EDBT Summer School 2015

Page 2: Similarity in Wikipedia Articles (EDBT Summer School)

Problem

2

Similarity between Wikipedia Articles

Wikipedia Article:

text

links

categories

Page 3: Similarity in Wikipedia Articles (EDBT Summer School)

Hypothesis

3

Wikipedia Article:

text

links

categories

simLinks

simCtg

simTextα·∙

β·∙

ɣ·∙

+

+

simWA(R1,R2)  =  α·∙simTxt(R1,R2)  +  β·∙simLinks(R1,R2)  +  ɣ·∙simCtg(R1,R2)

where  α+β+ɣ=1

Page 4: Similarity in Wikipedia Articles (EDBT Summer School)

Similarity based on Text

4

TOPIC_1

p = [0.5, 0.3,.., 0.7]q = [0.2, 0.4,.., 0.9]Ri Rj

TOPIC_2 TOPIC_n

Latent Dirichlet Allocation

Page 5: Similarity in Wikipedia Articles (EDBT Summer School)

Similarity based on Categories

5

Articles with multiple common categories are likely to be similar

Noise filtering is necessary (e.g., “All articles lacking in-text citations”). See https://github.com/cbadenes/siminwikart-challenge4/blob/master/category/wikipedia_bad_categories.txt

Page 6: Similarity in Wikipedia Articles (EDBT Summer School)

Similarity based on Links

6

Sim(A,B) = links(A) ∩ links(B) / ( (links(A) U links(B) ) / 2)

2/((5+3)/2)

Articles with multiple common links are likely to be similar

Page 7: Similarity in Wikipedia Articles (EDBT Summer School)

Proof of Concept

7

Fernando Alonso

Lionel Messi

Iker Casillas Princess Akiko

(simLinks) α = 0.2 (simCtg) β = 0.2 (simTxt) ɣ = 0.6

[1]0.062 [3]0.075

[1]0.666 [3]0.683

[1]0.058 [3]0.069

[1]0.043 [3]0.072

[1]0.019 [3]0.023

[1]0.068 [3]0.069

simTxt = 0.059 simLinks = 0.019 simCtg=[1]0.117

[3]0.181

simTxt = 0.065 simLinks = 0.0 simCtg=[1]0.095

[3]0.161

simTxt = 0.052 simLinks = 0.019 simCtg=[1]0.166

[3]0.172

simTxt = 0.980 simLinks = 0.175 simCtg=[1]0.217

[3]0.302

simTxt = 0.060 simLinks = 0.008 simCtg=[1]0.030

[3]0.172

simTxt = 0.069 simLinks = 0.004 simCtg=[1]0.080

[3]0.134

Page 8: Similarity in Wikipedia Articles (EDBT Summer School)

Comparison

8

Lionel Messi

Princess Akiko

simTxt = 0.060 -> <common words> simLinks = 0.008 -> (England,Buenos_Aires,Chile,Madrid,Argentina) simCtg=[1]0.030 -> living_person

Page 9: Similarity in Wikipedia Articles (EDBT Summer School)

Proposal

9

0.48

0.61

0.410.29

0.730.81

0.77

0.53

0.67

0.330.88

Graph based on Links Graph based on Similarities

Page 10: Similarity in Wikipedia Articles (EDBT Summer School)

Problem

10

Wikipedia links reliability (missing links)

Wikipedia Article:

text

links

categories

Page 11: Similarity in Wikipedia Articles (EDBT Summer School)

Further Refinement

11

Similarities between categories (as topics) can define relations between articles

Graph based on Links

0.48

0.61

0.410.29

0.730.81

0.77

0.53

0.67

0.330.88

Graph based on Similarities

Subgraph Pattern Matching

+

Topic Model

+

Page 12: Similarity in Wikipedia Articles (EDBT Summer School)

Code

12

https://github.com/cbadenes/siminwikart-challenge4