PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND...
Transcript of PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION C AND...
PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES
Dongwoo Kim with Lexing Xie and Cheng Soon OngCIKM 2016
1
WHAT IS KNOWLEDGE GRAPH (KG)?
• KG widely used in various tasks such as search and Q&A
• Triple (a basic unit of KG):
- Consists of source and target entities and a relation
- e.g: <DiCaprio, starred in, The Revenant>
2
INCOMPLETE KNOWLEDGE GRAPH
• KG is constructed from external sources
- Wikipedia, Web pages…
• Many existing KGs are incomplete
- 93.8% of persons from Freebase have no place of birth (Min et al, 2013)
- Hard to obtain from external sources
• Need alternative way to complete KG ?
3
IN THIS TALK, I WILL DISCUSS
• Knowledge graph completion task
- where our goal is to identify which triple is positive or negative
• Specifically, discuss two approaches for knowledge base completion:
1. Predict unknown triples based on observed triples
2. Manually label unknown triples with human experts
• These two approaches are complementary to each other
4
1. UNKNOWN TRIPLE PREDICTION
• Unknown triples may be inferred from existing triples
- <DiCaprio, starred in, The Revenant> → <DiCaprio, job, actor>
• Statistical relational model
- Goal: predict unknown triples from known triples
- Find latent semantics of entities and relations
5
TENSOR REPRESENTATION OF KG
• KG can be represented as 3-dimensional binary tensor
• or a collection of matrices each of which represent one relation
A Three-Way Model for Collective Learning on Multi-Relational Data
compute.
2. Modelling and NotationThe modelling approach of relational domains in this pa-per is the following. To represent dyadic relational datawe make use of the semantic web’s RDF formalism whererelations are modeled as triples of the form (subject,predicate, object) and where a predicate eithermodels the relationship between two entities or betweenan entity and an attribute value. In order to model dyadicrelational data as a tensor, we employ a three-way tensorX , where two modes are identically formed by the concate-nated entities of the domain and the third mode holds the re-lations.1 Figure 1 provides an illustration of this modellingmethod. A tensor entry X
ijk
= 1 denotes the fact that thereexists a relation (i-th entity, k-th predicate,j-th entity). Otherwise, for non-existing and un-known relations, the entry is set to zero.
Figure 1: Tensor model for relational data. E1 · · ·En denote theentities, while R1 · · ·Rm denote the relations in the domain
In the remainder of this paper we will use the followingnotation: Tensors are represented by calligraphic letters X ,while X
k
refers to the k-th frontal slice of the tensor X .X(n) denotes the unfolding of the tensor X in mode n. A⌦B refers to the Kronecker product of the matrices A and B.Ik
denotes the identity matrix of size k and vec(X) is thevectorization of the matrix X . Vectors are represented bybold lowercase letters, e.g. a. Furthermore, it is assumedthat the data is given as a n⇥ n⇥m tensor X , where n isthe number of entities and m the number of relations.
3. Related WorkThe literature on statistical relational learning is vast, thuswe will only give a very brief overview. A common ap-proach to relational learning is to employ graphical mod-els such as Bayesian Networks or Markov Logic net-works (Friedman et al., 1999; Richardson & Domingos,2006). Moreover, IHRM (Xu et al., 2006) and IRM (Kemp
1Please note that we don’t assume homogeneous domains,thus the entities of one mode can be instances of multiple classeslike persons, items, places, etc.
et al., 2006) are nonparametric Bayesian approaches torelational learning, while (Singh & Gordon, 2008) em-ploy collective matrix factorizations for relational learn-ing. (Getoor & Taskar, 2007) presents further approachesand gives a detailed introduction into the area of relationallearning. Tensor decompositions have been widely appliedto fields like psycho- and chemometrics. An extensive re-view of tensor decompositions and their applications canbe found in (Kolda & Bader, 2009). Recently, (Sutskeveret al., 2009) introduced the Bayesian Clustered Tensor Fac-torization (BCTF) and applied it to relational data. (Sunet al., 2006) presents methods for dynamic and streamingtensor analysis. These methods are used to analyse networktraffic and bibliographic data. (Franz et al., 2009) use CPto rank relational data that is given in form of RDF triples.
4. Methods and Theoretical AspectsRelational learning is concerned with domains where enti-ties are interconnected by multiple relations. Hence, corre-lations can not only occur directly between entities or re-lations, but also across these interconnections of differententities and relations. Depending on the learning task athand, it is known that it can be of great benefit when thelearning algorithm is able to detect these relational learn-ing specific correlations reliably. For instance, consider thetask of predicting the party membership of a president ofthe United States of America. Without any additional infor-mation, this can be done quite accurately when the party ofthe president’s vice president is known, since both personshave mostly been members of the same party. To includeinformation such as attributes, classes or relations of con-nected entities to support a classification task is commonlyreferred to as collective classification. However, this proce-dure can not only be useful in classification problems, butalso in entity resolution, link prediction or any other learn-ing task on relational data. We will refer to the mechanismof exploiting the information provided by related entitiesregardless of the particular learning task at hand as collec-tive learning.
4.1. A Model for Multi-Relational Data
To perform collective learning on multi-relational data, wepropose RESCAL, an approach which uses a tensor factor-ization model that takes the inherent structure of relationaldata into account. More precisely, we employ the follow-ing rank-r factorization, where each slice X
k
is factorizedas
Xk
⇡ AR
k
A
T
, for k = 1, . . . ,m (1)
Here, A is a n⇥r matrix that contains the latent-componentrepresentation of the entities in the domain and R
k
is anasymmetric r⇥ r matrix that models the interactions of the
Xxikj = 1
i = DiCaprioj = The Revenant k = starred in
{
xikj0 = 0i = DiCaprioj’ = Star Trek k = starred in
{
• Example: <DiCaprio, starred in, The Revenant> <DiCaprio, not starred in, Star Trek>
6
BILINEAR TENSOR FACTORISATION (RESCAL) (NICKEL, ET. AL, 2011)
• Latent space model
- Represent an entity as D-dimensional vector
- Represent a relation as DxD matrix
• Better performance on knowledge graph to compare with general purpose tensor factorisation methods (Dong, et. al, 2014)
7
eD =
0.80.1
�
eB =
0.10.2
�RRA =
0.1 0.90.1 0.1
�eO =
0.20.8
�
Leonardo DiCaprio
Yoshua Bengio
Receive Award Oscar
LATENT REPRESENTATION OF ENTITIES AND RELATIONS
xB·RA·O ⇡ e
>BRRAeO xD·RA·O ⇡ e
>DRRAeO
<Bengio, Receive, Oscar> <DiCaprio, Receive, Oscar>
8
eD =
0.80.1
�
eB =
0.10.2
�RRA =
0.1 0.90.1 0.1
�eO =
0.20.8
�
Leonardo DiCaprio
Yoshua Bengio
Receive Award Oscar
LATENT REPRESENTATION OF ENTITIES AND RELATIONS
<DiCaprio, Receive, Oscar>
9
<→ DiCaprio is more likely to get the Oscar award!
e>DRRAeO = 0.60e>BRRAeO = 0.16
<Bengio, Receive, Oscar>
Leonardo DiCaprio
Yoshua Bengio
Receive Award Oscar
GOAL OF STATISTICAL RELATIONAL MODELS
→ The goal of statistical relational model is to infer latent vectors & matrices from known triples
eD =
??
�
eB =
??
�RRA =
? ?? ?
�eO =
??
�
→ Predict new positive triples from inferred latent space10
PROBABILISTIC BILINEAR (COMPOSITIONAL) MODEL
• We propose probabilistic RESCAL (PRESCAL)
• Place a probability distribution over entity and relation matrices
• Place a probability distribution over triples
• Measure how likely a triple is positive or not
ei ⇠ ND(0,�2eI) Rk ⇠ ND2(0,�2
RI)
x
ijk
⇠ N (e>i
R
k
e
j
,�
2x
)
11
Leo Revenant Alejandro
starred in Directed by
z }| {Triple
| {z }Compositional Triple
KNOWLEDGE COMPOSITION
• Another Limitation of RESCAL: The model ignores the path information on KG during factorisation
• But, the factorisation model ignores the compositional relation between Leonardo DiCaprio and Alejandro G. Iñárritu
12
KG AUGMENTED WITH COMPOSITIONAL TRIPLES
• We propose a probabilistic compositional RESCAL
• Additional data with compositional relations
• Explicitly include path information into factorisation
Leo Revenant Alejandro
starred in Directed by
z }| {Triple
| {z }Compositional Triple
z }|{Compositional Triples
z }|{Original Tensor
13
TENSOR FACTORISATION ON REAL DATASETS
• Split each dataset into training / testing sets, measure predictive performance of test set given training set
• Examples: - KINSHIP: <person A, father of, person B> - UMLS: <disease E, affects, function F> - NATION: <nation C, supports, nation D>
# Relations # Entities # Positvie Triples Sparsity
KINSHIP 26 104 10790 0.038
UMLS 49 135 6752 0.008
NATION 56 14 2024 0.184Datasets
14
PREDICTION RESULT
• High AUC indicates better predictive performance
• Probabilistic models outperforms non-probabilistic model
• Compositional models outperform for the UMLS and NATION datasets
AUC
scor
e
0.01109
0.03329
0.05548
0.07768
0.09987
0.111207
0.131426
Training set proportion/size
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80NATION
RESCALPRESCALCOMPOSITIONAL
0.018930
0.0326790
0.0544651
0.0762511
0.0980372
0.1198232
0.13116093
Training set proportion/size
0.4
0.5
0.6
0.7
0.8
0.9
1.0UMLS
0.012812
0.038436
0.0514060
0.0719685
0.0925309
0.1130933
0.1336558
Training set proportion/size
0.4
0.5
0.6
0.7
0.8
0.9
1.0KINSHIP
15
2. COMPLETION WITH HUMAN EXPERTS
• Some domains may not have enough information to construct a KG (e.g. Bio-informatics, medical data)
- Without enough known triples, RESCAL models do not work well
• Each unknown entry of KG may be uncovered by domain experts
- Requires some amount of budget to uncover
- Too many triples to be evaluated, e.g. UMLS = 900,000 triples
• Goal: how to choose a triple to be evaluated next
- to maximise the number of positive triples given a limited budget
16
GREEDY INCREMENTAL COMPLETION WITH STATISTICAL MODEL
• Using prediction of bilinear model
• Greedy approach for incremental completion
1. Infer latent semantics of KG given observed triples
2. Query an unknown triple that has the highest expected value with the current approximation
3. Update late semantics based on the observation from queried result
4. Repeat 2-3 until reach the budget limitation
argmax
ijkE[e>i Rkej ]
17
EXPLOITATION-EXPLORATION
• Greedy approach always query the maximum triple: may not efficiently explorer the latent space
• We employ randomised probability matching (RPM) algorithm (Thompson, 1933; a.k.a Thompson sampling)
- Optimal balance between exploitation and exploration
- Instead of finding maximum triple, consider the uncertainty of unknown triples
argmax
ijkE[e>i Rkej ]
18
T1 = <DiCaprio, starred in, The Revenant>
T2 = <Benjio, receive, Oscar>
T3 = <DiCaprio, profession, Actor>
RANDOMISED PROBABILITY MATCHING
• Although T2 has the highest expectation, the variance of T2 is also high
• If we consider the uncertainty of probability, it might be better to choose T1
- Chose a triple to be labelled based on a random sample from its distribution
E[T1] = 0.9 E[T2] = 1.5 E[T3] = 0.2
19
EXPERIMENT: INCREMENTAL COMPLETION
• Comparison with two greedy algorithms (Kajino et al, 2015)
- AMDC-POP: Greedy population model, always queries the highest expected triple
- AMDC-PRED: Greedy prediction model, always queries the most ambiguous triple (expected value around 0.5)
• Performance metrics
- Cumulative gain: total number of positive triples up to time t
- AUC: predictive performance of model (used in the first task)20
COMPARISON WITH GREEDY APPROACHES
NATION KINSHIP UMLS
21
0 2000 4000 6000 8000 100000
500
1000
1500
2000
2500Cumulative gain
PNORMAL-TSAMDC-POPAMDC-PRED
0 2500 5000 7500 100000.4
0.6
0.8
1.0ROC-AUC score
0 500 1000 1500 20000
200
400
600
800
1000Cumulative gain
0 500 1000 1500 20000.4
0.6
0.8
1.0ROC-AUC score
0 2000 4000 6000 8000 100000
1000
2000
3000Cumulative gain
0 2500 5000 7500 100000.4
0.6
0.8
1.0ROC-AUC score
SUMMARY
• Knowledge graph completion task can be divided into unknown triple prediction and incremental completion
• We develop probabilistic RESCAL, and incorporate compositional structure into factorisation
• In the prediction task, compositional triples help to predict unknown triples
• In the construction task, randomised probability matching efficiently uncovers the latent semantic of KG
22
23
eB =
0.10.2
�RRA =
0.1 0.90.1 0.1
�eO =
0.20.8
�
Leonardo DiCaprio
Yoshua Bengio
Receive Award Oscar
Indicator of good actor
Indicator of prestigious award
relation between good actor and prestigious award in this relation
INTUITION BEHIND BILINEAR MODEL
→ Bilinear model infers latent semantics of entities and relations
eD =
0.80.1
�
24
COMPARISON WITH COMPOSITIONAL MODELS
NATION KINSHIP UMLS
0 2000 4000 6000 8000 100000
500
1000
1500
2000
2500Cumulative gain
PRESCAL-TSCOMPOSITIONAL-TS
0 2500 5000 7500 100000.4
0.6
0.8
1.0ROC-AUC score
0 500 1000 1500 20000
200
400
600
800
1000Cumulative gain
0 500 1000 1500 20000.4
0.6
0.8
1.0ROC-AUC score
0 2000 4000 6000 8000 100000
1000
2000
3000Cumulative gain
0 2500 5000 7500 100000.4
0.6
0.8
1.0ROC-AUC score
25
POSTERIOR VARIANCE ANALYSIS
• Variance of compositional model shrinks faster than non-compositional model
KINSHIP UMLS
26