PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND...

26
PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES Dongwoo Kim with Lexing Xie and Cheng Soon Ong CIKM 2016 1

Transcript of PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND...

Page 1: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES

Dongwoo Kim with Lexing Xie and Cheng Soon OngCIKM 2016

1

Page 2: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

WHAT IS KNOWLEDGE GRAPH (KG)?

• KG widely used in various tasks such as search and Q&A

• Triple (a basic unit of KG):

- Consists of source and target entities and a relation

- e.g: <DiCaprio, starred in, The Revenant>

2

Page 3: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

INCOMPLETE KNOWLEDGE GRAPH

• KG is constructed from external sources

- Wikipedia, Web pages…

• Many existing KGs are incomplete

- 93.8% of persons from Freebase have no place of birth (Min et al, 2013)

- Hard to obtain from external sources

• Need alternative way to complete KG ?

3

Page 4: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

IN THIS TALK, I WILL DISCUSS

• Knowledge graph completion task

- where our goal is to identify which triple is positive or negative

• Specifically, discuss two approaches for knowledge base completion:

1. Predict unknown triples based on observed triples

2. Manually label unknown triples with human experts

• These two approaches are complementary to each other

4

Page 5: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

1. UNKNOWN TRIPLE PREDICTION

• Unknown triples may be inferred from existing triples

- <DiCaprio, starred in, The Revenant> → <DiCaprio, job, actor>

• Statistical relational model

- Goal: predict unknown triples from known triples

- Find latent semantics of entities and relations

5

Page 6: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

TENSOR REPRESENTATION OF KG

• KG can be represented as 3-dimensional binary tensor

• or a collection of matrices each of which represent one relation

A Three-Way Model for Collective Learning on Multi-Relational Data

compute.

2. Modelling and NotationThe modelling approach of relational domains in this pa-per is the following. To represent dyadic relational datawe make use of the semantic web’s RDF formalism whererelations are modeled as triples of the form (subject,predicate, object) and where a predicate eithermodels the relationship between two entities or betweenan entity and an attribute value. In order to model dyadicrelational data as a tensor, we employ a three-way tensorX , where two modes are identically formed by the concate-nated entities of the domain and the third mode holds the re-lations.1 Figure 1 provides an illustration of this modellingmethod. A tensor entry X

ijk

= 1 denotes the fact that thereexists a relation (i-th entity, k-th predicate,j-th entity). Otherwise, for non-existing and un-known relations, the entry is set to zero.

Figure 1: Tensor model for relational data. E1 · · ·En denote theentities, while R1 · · ·Rm denote the relations in the domain

In the remainder of this paper we will use the followingnotation: Tensors are represented by calligraphic letters X ,while X

k

refers to the k-th frontal slice of the tensor X .X(n) denotes the unfolding of the tensor X in mode n. A⌦B refers to the Kronecker product of the matrices A and B.Ik

denotes the identity matrix of size k and vec(X) is thevectorization of the matrix X . Vectors are represented bybold lowercase letters, e.g. a. Furthermore, it is assumedthat the data is given as a n⇥ n⇥m tensor X , where n isthe number of entities and m the number of relations.

3. Related WorkThe literature on statistical relational learning is vast, thuswe will only give a very brief overview. A common ap-proach to relational learning is to employ graphical mod-els such as Bayesian Networks or Markov Logic net-works (Friedman et al., 1999; Richardson & Domingos,2006). Moreover, IHRM (Xu et al., 2006) and IRM (Kemp

1Please note that we don’t assume homogeneous domains,thus the entities of one mode can be instances of multiple classeslike persons, items, places, etc.

et al., 2006) are nonparametric Bayesian approaches torelational learning, while (Singh & Gordon, 2008) em-ploy collective matrix factorizations for relational learn-ing. (Getoor & Taskar, 2007) presents further approachesand gives a detailed introduction into the area of relationallearning. Tensor decompositions have been widely appliedto fields like psycho- and chemometrics. An extensive re-view of tensor decompositions and their applications canbe found in (Kolda & Bader, 2009). Recently, (Sutskeveret al., 2009) introduced the Bayesian Clustered Tensor Fac-torization (BCTF) and applied it to relational data. (Sunet al., 2006) presents methods for dynamic and streamingtensor analysis. These methods are used to analyse networktraffic and bibliographic data. (Franz et al., 2009) use CPto rank relational data that is given in form of RDF triples.

4. Methods and Theoretical AspectsRelational learning is concerned with domains where enti-ties are interconnected by multiple relations. Hence, corre-lations can not only occur directly between entities or re-lations, but also across these interconnections of differententities and relations. Depending on the learning task athand, it is known that it can be of great benefit when thelearning algorithm is able to detect these relational learn-ing specific correlations reliably. For instance, consider thetask of predicting the party membership of a president ofthe United States of America. Without any additional infor-mation, this can be done quite accurately when the party ofthe president’s vice president is known, since both personshave mostly been members of the same party. To includeinformation such as attributes, classes or relations of con-nected entities to support a classification task is commonlyreferred to as collective classification. However, this proce-dure can not only be useful in classification problems, butalso in entity resolution, link prediction or any other learn-ing task on relational data. We will refer to the mechanismof exploiting the information provided by related entitiesregardless of the particular learning task at hand as collec-tive learning.

4.1. A Model for Multi-Relational Data

To perform collective learning on multi-relational data, wepropose RESCAL, an approach which uses a tensor factor-ization model that takes the inherent structure of relationaldata into account. More precisely, we employ the follow-ing rank-r factorization, where each slice X

k

is factorizedas

Xk

⇡ AR

k

A

T

, for k = 1, . . . ,m (1)

Here, A is a n⇥r matrix that contains the latent-componentrepresentation of the entities in the domain and R

k

is anasymmetric r⇥ r matrix that models the interactions of the

Xxikj = 1

i = DiCaprioj = The Revenant k = starred in

{

xikj0 = 0i = DiCaprioj’ = Star Trek k = starred in

{

• Example: <DiCaprio, starred in, The Revenant> <DiCaprio, not starred in, Star Trek>

6

Page 7: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

BILINEAR TENSOR FACTORISATION (RESCAL) (NICKEL, ET. AL, 2011)

• Latent space model

- Represent an entity as D-dimensional vector

- Represent a relation as DxD matrix

• Better performance on knowledge graph to compare with general purpose tensor factorisation methods (Dong, et. al, 2014)

7

Page 8: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

eD =

0.80.1

eB =

0.10.2

�RRA =

0.1 0.90.1 0.1

�eO =

0.20.8

Leonardo DiCaprio

Yoshua Bengio

Receive Award Oscar

LATENT REPRESENTATION OF ENTITIES AND RELATIONS

xB·RA·O ⇡ e

>BRRAeO xD·RA·O ⇡ e

>DRRAeO

<Bengio, Receive, Oscar> <DiCaprio, Receive, Oscar>

8

Page 9: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

eD =

0.80.1

eB =

0.10.2

�RRA =

0.1 0.90.1 0.1

�eO =

0.20.8

Leonardo DiCaprio

Yoshua Bengio

Receive Award Oscar

LATENT REPRESENTATION OF ENTITIES AND RELATIONS

<DiCaprio, Receive, Oscar>

9

<→ DiCaprio is more likely to get the Oscar award!

e>DRRAeO = 0.60e>BRRAeO = 0.16

<Bengio, Receive, Oscar>

Page 10: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

Leonardo DiCaprio

Yoshua Bengio

Receive Award Oscar

GOAL OF STATISTICAL RELATIONAL MODELS

→ The goal of statistical relational model is to infer latent vectors & matrices from known triples

eD =

??

eB =

??

�RRA =

? ?? ?

�eO =

??

→ Predict new positive triples from inferred latent space10

Page 11: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

PROBABILISTIC BILINEAR (COMPOSITIONAL) MODEL

• We propose probabilistic RESCAL (PRESCAL)

• Place a probability distribution over entity and relation matrices

• Place a probability distribution over triples

• Measure how likely a triple is positive or not

ei ⇠ ND(0,�2eI) Rk ⇠ ND2(0,�2

RI)

x

ijk

⇠ N (e>i

R

k

e

j

,�

2x

)

11

Page 12: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

Leo Revenant Alejandro

starred in Directed by

z }| {Triple

| {z }Compositional Triple

KNOWLEDGE COMPOSITION

• Another Limitation of RESCAL: The model ignores the path information on KG during factorisation

• But, the factorisation model ignores the compositional relation between Leonardo DiCaprio and Alejandro G. Iñárritu

12

Page 13: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

KG AUGMENTED WITH COMPOSITIONAL TRIPLES

• We propose a probabilistic compositional RESCAL

• Additional data with compositional relations

• Explicitly include path information into factorisation

Leo Revenant Alejandro

starred in Directed by

z }| {Triple

| {z }Compositional Triple

z }|{Compositional Triples

z }|{Original Tensor

13

Page 14: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

TENSOR FACTORISATION ON REAL DATASETS

• Split each dataset into training / testing sets, measure predictive performance of test set given training set

• Examples: - KINSHIP: <person A, father of, person B> - UMLS: <disease E, affects, function F> - NATION: <nation C, supports, nation D>

# Relations # Entities # Positvie Triples Sparsity

KINSHIP 26 104 10790 0.038

UMLS 49 135 6752 0.008

NATION 56 14 2024 0.184Datasets

14

Page 15: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

PREDICTION RESULT

• High AUC indicates better predictive performance

• Probabilistic models outperforms non-probabilistic model

• Compositional models outperform for the UMLS and NATION datasets

AUC

scor

e

0.01109

0.03329

0.05548

0.07768

0.09987

0.111207

0.131426

Training set proportion/size

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80NATION

RESCALPRESCALCOMPOSITIONAL

0.018930

0.0326790

0.0544651

0.0762511

0.0980372

0.1198232

0.13116093

Training set proportion/size

0.4

0.5

0.6

0.7

0.8

0.9

1.0UMLS

0.012812

0.038436

0.0514060

0.0719685

0.0925309

0.1130933

0.1336558

Training set proportion/size

0.4

0.5

0.6

0.7

0.8

0.9

1.0KINSHIP

15

Page 16: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

2. COMPLETION WITH HUMAN EXPERTS

• Some domains may not have enough information to construct a KG (e.g. Bio-informatics, medical data)

- Without enough known triples, RESCAL models do not work well

• Each unknown entry of KG may be uncovered by domain experts

- Requires some amount of budget to uncover

- Too many triples to be evaluated, e.g. UMLS = 900,000 triples

• Goal: how to choose a triple to be evaluated next

- to maximise the number of positive triples given a limited budget

16

Page 17: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

GREEDY INCREMENTAL COMPLETION WITH STATISTICAL MODEL

• Using prediction of bilinear model

• Greedy approach for incremental completion

1. Infer latent semantics of KG given observed triples

2. Query an unknown triple that has the highest expected value with the current approximation

3. Update late semantics based on the observation from queried result

4. Repeat 2-3 until reach the budget limitation

argmax

ijkE[e>i Rkej ]

17

Page 18: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

EXPLOITATION-EXPLORATION

• Greedy approach always query the maximum triple: may not efficiently explorer the latent space

• We employ randomised probability matching (RPM) algorithm (Thompson, 1933; a.k.a Thompson sampling)

- Optimal balance between exploitation and exploration

- Instead of finding maximum triple, consider the uncertainty of unknown triples

argmax

ijkE[e>i Rkej ]

18

Page 19: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

T1 = <DiCaprio, starred in, The Revenant>

T2 = <Benjio, receive, Oscar>

T3 = <DiCaprio, profession, Actor>

RANDOMISED PROBABILITY MATCHING

• Although T2 has the highest expectation, the variance of T2 is also high

• If we consider the uncertainty of probability, it might be better to choose T1

- Chose a triple to be labelled based on a random sample from its distribution

E[T1] = 0.9 E[T2] = 1.5 E[T3] = 0.2

19

Page 20: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

EXPERIMENT: INCREMENTAL COMPLETION

• Comparison with two greedy algorithms (Kajino et al, 2015)

- AMDC-POP: Greedy population model, always queries the highest expected triple

- AMDC-PRED: Greedy prediction model, always queries the most ambiguous triple (expected value around 0.5)

• Performance metrics

- Cumulative gain: total number of positive triples up to time t

- AUC: predictive performance of model (used in the first task)20

Page 21: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

COMPARISON WITH GREEDY APPROACHES

NATION KINSHIP UMLS

21

0 2000 4000 6000 8000 100000

500

1000

1500

2000

2500Cumulative gain

PNORMAL-TSAMDC-POPAMDC-PRED

0 2500 5000 7500 100000.4

0.6

0.8

1.0ROC-AUC score

0 500 1000 1500 20000

200

400

600

800

1000Cumulative gain

0 500 1000 1500 20000.4

0.6

0.8

1.0ROC-AUC score

0 2000 4000 6000 8000 100000

1000

2000

3000Cumulative gain

0 2500 5000 7500 100000.4

0.6

0.8

1.0ROC-AUC score

Page 22: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

SUMMARY

• Knowledge graph completion task can be divided into unknown triple prediction and incremental completion

• We develop probabilistic RESCAL, and incorporate compositional structure into factorisation

• In the prediction task, compositional triples help to predict unknown triples

• In the construction task, randomised probability matching efficiently uncovers the latent semantic of KG

22

Page 23: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

23

Page 24: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

eB =

0.10.2

�RRA =

0.1 0.90.1 0.1

�eO =

0.20.8

Leonardo DiCaprio

Yoshua Bengio

Receive Award Oscar

Indicator of good actor

Indicator of prestigious award

relation between good actor and prestigious award in this relation

INTUITION BEHIND BILINEAR MODEL

→ Bilinear model infers latent semantics of entities and relations

eD =

0.80.1

24

Page 25: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

COMPARISON WITH COMPOSITIONAL MODELS

NATION KINSHIP UMLS

0 2000 4000 6000 8000 100000

500

1000

1500

2000

2500Cumulative gain

PRESCAL-TSCOMPOSITIONAL-TS

0 2500 5000 7500 100000.4

0.6

0.8

1.0ROC-AUC score

0 500 1000 1500 20000

200

400

600

800

1000Cumulative gain

0 500 1000 1500 20000.4

0.6

0.8

1.0ROC-AUC score

0 2000 4000 6000 8000 100000

1000

2000

3000Cumulative gain

0 2500 5000 7500 100000.4

0.6

0.8

1.0ROC-AUC score

25

Page 26: PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND ...cm.cecs.anu.edu.au/documents/kim_cikm16_slides.pdf• Knowledge graph completion task -where our goal is to identify which triple

POSTERIOR VARIANCE ANALYSIS

• Variance of compositional model shrinks faster than non-compositional model

KINSHIP UMLS

26