Traversing Knowledge Graphs in Vector Space (full)

149
Traversing Knowledge Graphs in Vector Space Kelvin Guu, John Miller, Percy Liang This talk is about how to traverse knowledge graphs in vector space and the surprising benefits you get from doing so.

description

Slides from EMNLP 2015. Includes presenter notes + a bonus section at the end that I didn't cover at EMNLP!

Transcript of Traversing Knowledge Graphs in Vector Space (full)

Page 1: Traversing Knowledge Graphs in Vector Space (full)

Traversing Knowledge Graphs in Vector Space

Kelvin Guu, John Miller, Percy Liang

This talk is about how to traverse knowledge graphs in vector space and the surprising benefits you get from doing so.

Page 2: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 3: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphsWhat languages are spoken by people in Portugal?

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 4: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

portugal/location/language

What languages are spoken by people in Portugal?

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 5: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

portugal/location/language

What languages are spoken by people in Portugal?

portugal

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 6: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

portugal/location/language

What languages are spoken by people in Portugal?

portugal

fernando_pessoa

jorge_sampaio

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 7: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

portugal/location/language

What languages are spoken by people in Portugal?

portugal

fernando_pessoa

jorge_sampaio

So, what is great about knowledge graphs?One of their most powerful aspects is that they support compositional queries.

For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph traversal.

You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.

Page 8: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugallocation

But one of their weaknesses is that they are often incomplete.

For example, we might be missing the fact that Fernando Pessoa was located in Portugal.

Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge and triple interchangeably for the rest of the talk.

Page 9: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugal

(fernando_pessoa, location, portugal)

location

But one of their weaknesses is that they are often incomplete.

For example, we might be missing the fact that Fernando Pessoa was located in Portugal.

Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge and triple interchangeably for the rest of the talk.

Page 10: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugal

(fernando_pessoa, location, portugal)

location

“fact” = “edge” = “triple”

But one of their weaknesses is that they are often incomplete.

For example, we might be missing the fact that Fernando Pessoa was located in Portugal.

Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge and triple interchangeably for the rest of the talk.

Page 11: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugallocation

93.8% of persons from Freebase have no place of birth, and 78.5% have no nationality [Min et al, 2013]

To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality

So, knowledge graphs are good for compositionality, but suffer from incompleteness.

Page 12: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugallocation

93.8% of persons from Freebase have no place of birth, and 78.5% have no nationality [Min et al, 2013]

strength:compositionality

To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality

So, knowledge graphs are good for compositionality, but suffer from incompleteness.

Page 13: Traversing Knowledge Graphs in Vector Space (full)

Knowledge graphs

french

portuguese

english

fernando_pessoa

jorge_sampaio

portugallocation

93.8% of persons from Freebase have no place of birth, and 78.5% have no nationality [Min et al, 2013]

weakness:incompleteness

strength:compositionality

To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality

So, knowledge graphs are good for compositionality, but suffer from incompleteness.

Page 14: Traversing Knowledge Graphs in Vector Space (full)

Vector space models

fernandopessoa

portugal

barackobama

unitedstates

Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.

Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships between points.

By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that makes it predict new, previously unseen facts.

This is the general principle that compression can lead to generalization.

Page 15: Traversing Knowledge Graphs in Vector Space (full)

Vector space models

fernandopessoa

portugal

barackobama

unitedstates

location

location

Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.

Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships between points.

By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that makes it predict new, previously unseen facts.

This is the general principle that compression can lead to generalization.

Page 16: Traversing Knowledge Graphs in Vector Space (full)

Vector space models

fernandopessoa

portugal

barackobama

unitedstates

location

location

large # facts —> low-dimensional space

Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.

Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships between points.

By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that makes it predict new, previously unseen facts.

This is the general principle that compression can lead to generalization.

Page 17: Traversing Knowledge Graphs in Vector Space (full)

Vector space models

fernandopessoa

portugal

barackobama

unitedstates

location

locationcompression —> generalization

large # facts —> low-dimensional space

Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.

Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships between points.

By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that makes it predict new, previously unseen facts.

This is the general principle that compression can lead to generalization.

Page 18: Traversing Knowledge Graphs in Vector Space (full)

Vector space models• Tensor factorization [Nickel et al., 2011] • Neural Tensor Network [Socher et al., 2013] • TransE [Bordes et al., 2013] • Universal Schema [Riedel et al., 2013] • General framework + comparison [Yang et al.,

2015] • Compositional embedding of paths [Neelakantan et

al., 2015]

There has been a significant amount of work on this topic, and I’ve just listed a few related papers here.

They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.

Page 19: Traversing Knowledge Graphs in Vector Space (full)

Vector space models• Tensor factorization [Nickel et al., 2011] • Neural Tensor Network [Socher et al., 2013] • TransE [Bordes et al., 2013] • Universal Schema [Riedel et al., 2013] • General framework + comparison [Yang et al.,

2015] • Compositional embedding of paths [Neelakantan et

al., 2015]

strength:handle incompleteness

There has been a significant amount of work on this topic, and I’ve just listed a few related papers here.

They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.

Page 20: Traversing Knowledge Graphs in Vector Space (full)

Vector space models• Tensor factorization [Nickel et al., 2011] • Neural Tensor Network [Socher et al., 2013] • TransE [Bordes et al., 2013] • Universal Schema [Riedel et al., 2013] • General framework + comparison [Yang et al.,

2015] • Compositional embedding of paths [Neelakantan et

al., 2015]

weakness:no compositionality

strength:handle incompleteness

There has been a significant amount of work on this topic, and I’ve just listed a few related papers here.

They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.

Page 21: Traversing Knowledge Graphs in Vector Space (full)

Graph databases Vector space models

Compositional queries

Handle incompleteness

So, we have seen two different ways to store knowledge, each with their own pros and cons.

Graph databases are very precise and can handle compositional queries, but are largely incomplete.

Vector space models can infer missing facts through compression, but so far it is not clear how they can support compositional queries.

Page 22: Traversing Knowledge Graphs in Vector Space (full)

Graph databases Vector space models

Compositional queries

Handle incompleteness

This talk

This talk is about making it possible for vector space models to also handle compositional queries, despite the inherent errors introduced by compression.

And surprisingly, we show that when we extend vector space models to handle compositional queries, they also improve at their original purpose of inferring missing facts.

Page 23: Traversing Knowledge Graphs in Vector Space (full)

Graph databases Vector space models

Compositional queries

Handle incompleteness

This talk

This talk is about making it possible for vector space models to also handle compositional queries, despite the inherent errors introduced by compression.

And surprisingly, we show that when we extend vector space models to handle compositional queries, they also improve at their original purpose of inferring missing facts.

Page 24: Traversing Knowledge Graphs in Vector Space (full)

Graph databases Vector space models

Compositional queries

Handle incompleteness

This talk

PART I

PART II

So, roughly speaking, this talk will have two parts, one for each contribution.

Page 25: Traversing Knowledge Graphs in Vector Space (full)

Outline

Here’s what will happen in a little more detail.

Page 26: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I

Here’s what will happen in a little more detail.

Page 27: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

Here’s what will happen in a little more detail.

Page 28: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

Here’s what will happen in a little more detail.

Page 29: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

Here’s what will happen in a little more detail.

Page 30: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

Here’s what will happen in a little more detail.

Page 31: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

PART II

Here’s what will happen in a little more detail.

Page 32: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

PART II• Show that when we train these models to answer path queries,

they actually become better at predicting missing facts.

Here’s what will happen in a little more detail.

Page 33: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

PART II• Show that when we train these models to answer path queries,

they actually become better at predicting missing facts.

Here’s what will happen in a little more detail.

Page 34: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

PART II• Show that when we train these models to answer path queries,

they actually become better at predicting missing facts.

• Explain why this new path-based training procedure helps.

Here’s what will happen in a little more detail.

Page 35: Traversing Knowledge Graphs in Vector Space (full)

Path queries

So, let’s start with Part 1.We will focus on how to answer a particular kind of compositional query, path queries.Our earlier example actually already illustrated a path query.

Here it is again on the bottom line.A path query is processed recursively.

Page 36: Traversing Knowledge Graphs in Vector Space (full)

Path queries

portugal/location/language

What languages are spoken by people in Portugal?

So, let’s start with Part 1.We will focus on how to answer a particular kind of compositional query, path queries.Our earlier example actually already illustrated a path query.

Here it is again on the bottom line.A path query is processed recursively.

Page 37: Traversing Knowledge Graphs in Vector Space (full)

Path queries

{portugal}

portugal

We start with a single token, denoting a set containing just one entity, portugal.

Page 38: Traversing Knowledge Graphs in Vector Space (full)

Path queries

{portugal}

portugal / location

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

At every stage, we compute a set of entities, which is all we need to compute the next set.

Page 39: Traversing Knowledge Graphs in Vector Space (full)

Path queries

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

This final set is the answer to the path query.

Page 40: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 41: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 42: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 43: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x ->

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 44: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x -> x ->

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 45: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x -> x ->

dense, low-dim vectors

x -> x ->

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 46: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x -> x ->

dense, low-dim vectors

x -> x ->

set vectors!

You can imagine that each of these sets is represented as a sparse vectorWe can then traverse from one set to another by multiplying a set by a relation’s adjacency matrixThe result is another vector whose non-zero entries represent the new set.

Now, the point of this is to connect graph traversal with vector space models.Since vector space models rely on limiting their dimensionality to achieve generalization, let’s imagine that we can take these sparse adjacency matrices and somehow compress them into dense matrices

In the resulting model, we can still identify these guys as set vectors.

Page 47: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal}

portugal / location / language

{fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x -> x ->

dense, low-dim vectors

x -> x ->

traversal operators

And we can identify these guys as traversal operators.

Page 48: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

{portugal} {fernando_pessoa, jorge_sampaio,

… vasco_da_gama}

{portuguese, spanish,

… english}

sparse vectors x -> x ->

dense, low-dim vectors

x -> x ->

membership operator:dot product

portugal / location / language

xenglish

xenglish

Now suppose we want to check whether English is in the final setIn the sparse vector setup, you can do this by dot-producting the set vector with a one-hot vector representing English. If the resulting score is non-zero, English is in the set.

In the new dense setup, you can still analogously dot product with a dense vector representing English.If the score is high, we say that English is in the set. If the score is low, then it is not.

Page 49: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

q = s / r1 / r2 / … / rk

Now let’s switch over to looking at what we just covered, in more abstract form.

We had a path query q.It started with an anchor entity s, followed by a sequence of relations r1 through rk.

To compute the answer set, we started with x_s and multiplied by a sequence of “compressed adjacency matrices”.

To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.

So now the question is: where do we get these dense entity vectors x from and these “compressed adjacency matrices” W?

Page 50: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

q = s / r1 / r2 / … / rk

x

>s Wr1Wr2 . . .Wrk

Now let’s switch over to looking at what we just covered, in more abstract form.

We had a path query q.It started with an anchor entity s, followed by a sequence of relations r1 through rk.

To compute the answer set, we started with x_s and multiplied by a sequence of “compressed adjacency matrices”.

To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.

So now the question is: where do we get these dense entity vectors x from and these “compressed adjacency matrices” W?

Page 51: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

q = s / r1 / r2 / … / rk

x

>s Wr1Wr2 . . .Wrk

is t an answer to q?

Now let’s switch over to looking at what we just covered, in more abstract form.

We had a path query q.It started with an anchor entity s, followed by a sequence of relations r1 through rk.

To compute the answer set, we started with x_s and multiplied by a sequence of “compressed adjacency matrices”.

To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.

So now the question is: where do we get these dense entity vectors x from and these “compressed adjacency matrices” W?

Page 52: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

q = s / r1 / r2 / … / rk

x

>s Wr1Wr2 . . .Wrk

is t an answer to q?

xtscore(q, t) =

Now let’s switch over to looking at what we just covered, in more abstract form.

We had a path query q.It started with an anchor entity s, followed by a sequence of relations r1 through rk.

To compute the answer set, we started with x_s and multiplied by a sequence of “compressed adjacency matrices”.

To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.

So now the question is: where do we get these dense entity vectors x from and these “compressed adjacency matrices” W?

Page 53: Traversing Knowledge Graphs in Vector Space (full)

Path queries in vector space

q = s / r1 / r2 / … / rk

x

>s Wr1Wr2 . . .Wrk

is t an answer to q?

xtscore(q, t) =

Now let’s switch over to looking at what we just covered, in more abstract form.

We had a path query q.It started with an anchor entity s, followed by a sequence of relations r1 through rk.

To compute the answer set, we started with x_s and multiplied by a sequence of “compressed adjacency matrices”.

To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.

So now the question is: where do we get these dense entity vectors x from and these “compressed adjacency matrices” W?

Page 54: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 55: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 56: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

margin (q, t, t0) = score (q, t)� score (q, t0)

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 57: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

margin (q, t, t0) = score (q, t)� score (q, t0)

nX

i=1

X

t02N (qi)

[1�margin (qi, ti, t0)]+

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 58: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

margin (q, t, t0) = score (q, t)� score (q, t0)

nX

i=1

X

t02N (qi)

[1�margin (qi, ti, t0)]+

SGD

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 59: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

margin (q, t, t0) = score (q, t)� score (q, t0)

nX

i=1

X

t02N (qi)

[1�margin (qi, ti, t0)]+

SGD

Next: generalize existing models

The answer is that we learn all of them.Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).We can define a margin to be the difference between the score of a correct answer and the score of an incorrect answer.We can then write down a typical objective to maximize that margin.Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.

So, I’ve just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing vector space models to answer path queries.

Let’s do that next.

Page 60: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models• Tensor factorization [Nickel et al., 2011] • Neural Tensor Network [Socher et al., 2013] • TransE [Bordes et al., 2013] • Universal Schema [Riedel et al., 2013] • General framework + comparison [Yang et al.,

2015]

The existing models that we will generalize all have the following form…

Page 61: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

They all define a scoring function on (subject, predicate, object) triples.The score should be high if the triple is true, and low otherwise.

You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.

The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.

It’s easy to see this pictorially.

Page 62: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

score (subject, predicate, object)

They all define a scoring function on (subject, predicate, object) triples.The score should be high if the triple is true, and low otherwise.

You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.

The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.

It’s easy to see this pictorially.

Page 63: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

score (subject, predicate, object)

score (obama, nationality, united states)

score (obama, nationality, germany)

= HIGH= LOW

They all define a scoring function on (subject, predicate, object) triples.The score should be high if the triple is true, and low otherwise.

You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.

The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.

It’s easy to see this pictorially.

Page 64: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

score (subject, predicate, object)

score (obama, nationality, united states)

score (obama, nationality, germany)

= HIGH= LOW

score (obama, nationality, france)?

They all define a scoring function on (subject, predicate, object) triples.The score should be high if the triple is true, and low otherwise.

You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.

The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.

It’s easy to see this pictorially.

Page 65: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

score (subject, predicate, object)

score (obama, nationality, united states)

score (obama, nationality, germany)

= HIGH= LOW

score (obama, nationality, france)?

special case of answering

path queries

They all define a scoring function on (subject, predicate, object) triples.The score should be high if the triple is true, and low otherwise.

You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.

The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.

It’s easy to see this pictorially.

Page 66: Traversing Knowledge Graphs in Vector Space (full)

Existing vector space models

r

s

t

Predict whether the triple (s, r, t) is true

Existing vector space models predict whether the triple (s, r, t) is true

Page 67: Traversing Knowledge Graphs in Vector Space (full)

New path query models

r3r1

r2r4

Predict whether a path exists between s and t

s

t

The new path query models we will propose predict whether a path exists between s and t.

We’re going from a single hop to multiple hops.

Page 68: Traversing Knowledge Graphs in Vector Space (full)

New path query models

r3r1

r2r4

Predict whether a path exists between s and t

s

t

single-hop —> multi-hop

The new path query models we will propose predict whether a path exists between s and t.

We’re going from a single hop to multiple hops.

Page 69: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 70: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

Bilinear(Nickel+, 2012)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 71: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt

Bilinear(Nickel+, 2012)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 72: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 73: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 74: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 75: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 76: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr � xtk22

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 77: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 78: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 79: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 80: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

M (Tr (xs) , xt)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 81: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

M (Trk (. . .Tr1 (xs)) , xt)M (Tr (xs) , xt)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 82: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

M (Trk (. . .Tr1 (xs)) , xt)M (Tr (xs) , xt)

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 83: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

M (Trk (. . .Tr1 (xs)) , xt)M (Tr (xs) , xt)

answer path queries

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 84: Traversing Knowledge Graphs in Vector Space (full)

Original model(single edge)

Generalize existing modelsNew model

(path)

x

>s Wrxt x

>s Wr1Wr2 . . .Wrkxt

Bilinear(Nickel+, 2012)

TransE(Bordes+, 2013) �kxs + wr1 . . .+ wrk � xtk22�kxs + wr � xtk22

More generally(??, 2015)

M (Trk (. . .Tr1 (xs)) , xt)M (Tr (xs) , xt)

answer path queries

a new way to trainexisting models

With all this in mind, we can generalize many existing models.

The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.This turns out to be the length-1 special case of the model we proposed 5 slides ago.

You can see that we identify W as a traversal operator and repeatedly apply it.

The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and repeatedly apply it to answer path queries.

More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.

So, now we can generalize any of these existing vector space models to answer path queries.But perhaps more importantly, this generalization gives us a new way to train existing models.

Page 85: Traversing Knowledge Graphs in Vector Space (full)

Old: single-edge training

r

s

t

Predict whether the triple (s, r, t) is true

In the old way of training these models, you trained them to score single edges correctly. We will call this single-edge training.

Page 86: Traversing Knowledge Graphs in Vector Space (full)

New: path training

r3r1

r2r4

Predict whether a path exists between s and t

s

t

We can now instead train these models to score full paths, for which edges are one special case. We will call this path training.

Page 87: Traversing Knowledge Graphs in Vector Space (full)

• Training examples:

• Margin:

• Objective:

• Algorithm:

Training

(q, t)

margin (q, t, t0) = score (q, t)� score (q, t0)

nX

i=1

X

t02N (qi)

[1�margin (qi, ti, t0)]+

SGD

The training objective is exactly what we discussed before.

Page 88: Traversing Knowledge Graphs in Vector Space (full)

OutlinePART I• Interpret many existing vector space models as each

implementing a traversal operator.

• Propose how all these models can be generalized to answer path queries.

PART II• Show that when we train these models to answer path queries,

they actually become better at predicting missing facts.

• Explain why this new path-based training procedure helps.

We’re now on to the second part of the talk, where we will see the surprising benefits of path training, and why it is better than single-edge training.

Page 89: Traversing Knowledge Graphs in Vector Space (full)

Experiments

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASK

Here is a simple way to think about the experimental results you are about to see.

As I said before, we will compare path training versus single-edge training.We will compare them on two tasks.• One is the path querying task, where we evaluate a model’s ability to answer path queries, just like the running example we’ve been using throughout this talk.• The second is the single-edge prediction task. This measures a model’s ability to correctly classify triples as true or false. This is the original task that all of the

previously mentioned vector space models were designed for.

Page 90: Traversing Knowledge Graphs in Vector Space (full)

Experiments

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

Here is a simple way to think about the experimental results you are about to see.

As I said before, we will compare path training versus single-edge training.We will compare them on two tasks.• One is the path querying task, where we evaluate a model’s ability to answer path queries, just like the running example we’ve been using throughout this talk.• The second is the single-edge prediction task. This measures a model’s ability to correctly classify triples as true or false. This is the original task that all of the

previously mentioned vector space models were designed for.

Page 91: Traversing Knowledge Graphs in Vector Space (full)

Experiments

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

Here is a simple way to think about the experimental results you are about to see.

As I said before, we will compare path training versus single-edge training.We will compare them on two tasks.• One is the path querying task, where we evaluate a model’s ability to answer path queries, just like the running example we’ve been using throughout this talk.• The second is the single-edge prediction task. This measures a model’s ability to correctly classify triples as true or false. This is the original task that all of the

previously mentioned vector space models were designed for.

Page 92: Traversing Knowledge Graphs in Vector Space (full)

Experiments

PATH PATH-1

PATH

PATH-1

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

But remember that the single-edge prediction task is just equivalent to answering path queries of length 1.

Page 93: Traversing Knowledge Graphs in Vector Space (full)

Train and test distributions match

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

Note that in checkmarked cells, the training example distribution will match the test example distribution, whereas it will not in the off-diagonal cells.

Page 94: Traversing Knowledge Graphs in Vector Space (full)

Datasets

[Chen et al, 2013] WordNet Freebase

Entities 39,000 75,000Relations 11 13

Train edges 113,000 316,000Test edges 11,000 24,000

We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.

WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words

Freebase, as you have already seen, contains common facts about important entities.Below are some examples of facts found in the two knowledge graphs.

For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.

Page 95: Traversing Knowledge Graphs in Vector Space (full)

Datasets

[Chen et al, 2013] WordNet Freebase

Entities 39,000 75,000Relations 11 13

Train edges 113,000 316,000Test edges 11,000 24,000

• (laugh, has_instance, giggle)

We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.

WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words

Freebase, as you have already seen, contains common facts about important entities.Below are some examples of facts found in the two knowledge graphs.

For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.

Page 96: Traversing Knowledge Graphs in Vector Space (full)

Datasets

[Chen et al, 2013] WordNet Freebase

Entities 39,000 75,000Relations 11 13

Train edges 113,000 316,000Test edges 11,000 24,000

• (laugh, has_instance, giggle)• (snort, derivationally_related, snorter)

We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.

WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words

Freebase, as you have already seen, contains common facts about important entities.Below are some examples of facts found in the two knowledge graphs.

For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.

Page 97: Traversing Knowledge Graphs in Vector Space (full)

Datasets

[Chen et al, 2013] WordNet Freebase

Entities 39,000 75,000Relations 11 13

Train edges 113,000 316,000Test edges 11,000 24,000

• (laugh, has_instance, giggle)• (snort, derivationally_related, snorter)• (abraham_lincoln, place_of_birth, kentucky)

We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.

WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words

Freebase, as you have already seen, contains common facts about important entities.Below are some examples of facts found in the two knowledge graphs.

For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.

Page 98: Traversing Knowledge Graphs in Vector Space (full)

Path training: way more data

~ WordNet Freebase

Single-edge training 113,000 316,000

Path training 2,000,000 6,000,000

Single-edge test 11,000 24,000

Path test 45,000 110,000

An important factor to note is that if a knowledge graph has several hundred thousand edges, it can contain over 100 times as many paths, formed from those edges.

We sampled just a small subset of those paths to train and test on.But that small subset is stil 20 times more data than the original edges.

Page 99: Traversing Knowledge Graphs in Vector Space (full)

Path sampling

r3r1

r2r4

1. start at a random node in the knowledge graph 2. traverse a random sequence of relations 3. The final destination marked as one answer to the query.

s

t

Here is how we sampled paths.

Page 100: Traversing Knowledge Graphs in Vector Space (full)

Several models that we generalize

To better establish the effect of path training, we applied it to three different vector space models. All of them were previously reported to achieve state-of-the-art results under different setups.

I’ve listed the traversal operators that each model corresponds to.

Page 101: Traversing Knowledge Graphs in Vector Space (full)

Evaluation metricportugal / location / language

Our evaluation metric is quite simple.Given a path query, we will score all possible answers. This results in a ranking over all answers.

We define the quantile to be the percentage of negatives ranked after the correct answer.In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.

So, without further ado, here are the results…

Page 102: Traversing Knowledge Graphs in Vector Space (full)

Evaluation metricportugal / location / language

1. klingon 2. portuguese3. latin 4. javascript 5. chinese

Our evaluation metric is quite simple.Given a path query, we will score all possible answers. This results in a ranking over all answers.

We define the quantile to be the percentage of negatives ranked after the correct answer.In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.

So, without further ado, here are the results…

Page 103: Traversing Knowledge Graphs in Vector Space (full)

Evaluation metricportugal / location / language

Quantile: percentage of negatives ranked after correct answer (100 is best, 0 is worst)

1. klingon 2. portuguese3. latin 4. javascript 5. chinese

Our evaluation metric is quite simple.Given a path query, we will score all possible answers. This results in a ranking over all answers.

We define the quantile to be the percentage of negatives ranked after the correct answer.In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.

So, without further ado, here are the results…

Page 104: Traversing Knowledge Graphs in Vector Space (full)

Evaluation metricportugal / location / language

Quantile: percentage of negatives ranked after correct answer (100 is best, 0 is worst)

75%1. klingon 2. portuguese3. latin 4. javascript 5. chinese

Our evaluation metric is quite simple.Given a path query, we will score all possible answers. This results in a ranking over all answers.

We define the quantile to be the percentage of negatives ranked after the correct answer.In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.

So, without further ado, here are the results…

Page 105: Traversing Knowledge Graphs in Vector Space (full)

Experiments

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

We’ll start with results on the path task

Page 106: Traversing Knowledge Graphs in Vector Space (full)

Results: path querying

Mea

n qu

antil

e

50

60

70

80

90

100

BilinearEdge Path

Looking first at the Bilinear model’s performance on Freebase, we see that path training performs significantly better on the path task.

In terms of average quantile, we’re seeing a boost of well over 20 points.

Page 107: Traversing Knowledge Graphs in Vector Space (full)

Results: path querying

Free

base

5060708090

100

Bilinear Bilinear-Diag TransEEdge Path Edge Path Edge Path

Wor

dNet

5060708090

100

Bilinear Bilinear-Diag TransEEdge Path Edge Path Edge Path

Expanding our view to both datasets and all three models, we see that this trend holds up across the board.

Path training definitely helps all these models improve at the path query task.

Page 108: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help?

So, why does it help?

The first answer is not hard. Our train and test distributions match.

But I think there’s still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges, shouldn’t we know about the paths that they form?

Page 109: Traversing Knowledge Graphs in Vector Space (full)

• Train and test distributions match: If you test on path queries, you should train on path queries, not single edges.

Why does path training help?

So, why does it help?

The first answer is not hard. Our train and test distributions match.

But I think there’s still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges, shouldn’t we know about the paths that they form?

Page 110: Traversing Knowledge Graphs in Vector Space (full)

• Train and test distributions match: If you test on path queries, you should train on path queries, not single edges.

• Why should single-edge training be so bad at path queries?

Why does path training help?

So, why does it help?

The first answer is not hard. Our train and test distributions match.

But I think there’s still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges, shouldn’t we know about the paths that they form?

Page 111: Traversing Knowledge Graphs in Vector Space (full)

• Train and test distributions match: If you test on path queries, you should train on path queries, not single edges.

• Why should single-edge training be so bad at path queries?

• Path query training cuts down on cascading errors

Why does path training help?

So, why does it help?

The first answer is not hard. Our train and test distributions match.

But I think there’s still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges, shouldn’t we know about the paths that they form?

Page 112: Traversing Knowledge Graphs in Vector Space (full)

Cascading errorsWho is Tad Lincoln’s grandparent?

We hypothesize that the problem is due to cascading errors, which arise as a side-effect of how vector space models try to compress knowledge into low-dimensional space.

Here is a cartoon depiction of what we suspect is going on.Suppose that you want to know who Tad Lincoln’s grandparent is.And suppose that the parent traversal operation is represented by a translation to the right, as in this figure.

The single-edge training objective forces the vector for abraham_lincoln to be close to where it should be, denoted by the red circle. But because we are compressing facts into a low-dimensional space, and because we’re using a ranking objective, abraham lincoln never makes it to exactly where he should be. This results in some error, shown by the blue noise cloud.

After another step of traversal from abe lincoln to thomas_lincoln, you’ve accumulated more error.

As you traverse, the correct answer drifts farther and farther away from where you expect it to be.

Page 113: Traversing Knowledge Graphs in Vector Space (full)

Cascading errorsWho is Tad Lincoln’s grandparent?

We hypothesize that the problem is due to cascading errors, which arise as a side-effect of how vector space models try to compress knowledge into low-dimensional space.

Here is a cartoon depiction of what we suspect is going on.Suppose that you want to know who Tad Lincoln’s grandparent is.And suppose that the parent traversal operation is represented by a translation to the right, as in this figure.

The single-edge training objective forces the vector for abraham_lincoln to be close to where it should be, denoted by the red circle. But because we are compressing facts into a low-dimensional space, and because we’re using a ranking objective, abraham lincoln never makes it to exactly where he should be. This results in some error, shown by the blue noise cloud.

After another step of traversal from abe lincoln to thomas_lincoln, you’ve accumulated more error.

As you traverse, the correct answer drifts farther and farther away from where you expect it to be.

Page 114: Traversing Knowledge Graphs in Vector Space (full)

Cascading errorsWho is Tad Lincoln’s grandparent?

pathtraining

In contrast, the path training objective is directly sensitive to the gap between where thomas_lincoln is, and where he should be. This error is quite large and easier for the model to detect, compared to many small errors.

Page 115: Traversing Knowledge Graphs in Vector Space (full)

Example

We can actually check this hypothesis by looking at how much drift is happening in a path-trained model versus a single-edge trained model.

We can do this because each prefix of a path query is itself a path query. So, we can just look at the quality of the results at each stage.

Empirically, we indeed found that the path-trained model drifts less.

Page 116: Traversing Knowledge Graphs in Vector Space (full)

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

Okay, so we finished looking at the path task and saw that path training was better.

But you might suspect that it doesn’t fare so well on the single-edge task, since the training and test distribution no longer match.

We certainly had serious doubts about this ourselves, because multi-task training is notoriously tricky to get right.

Page 117: Traversing Knowledge Graphs in Vector Space (full)

Results: single-edge task

Surprisingly, path training actually does better on the single-edge task as well, across both datasets and all models.

The one exception is TransE on Freebase, where path training doesn’t hurt, but doesn’t help much.

Page 118: Traversing Knowledge Graphs in Vector Space (full)

Results: single-edge task

Free

base

5060708090

100

Bilinear Bilinear-Diag TransEEdge Path Edge Path Edge Path

Wor

dNet

5060708090

100

Bilinear Bilinear-Diag TransEEdge Path Edge Path Edge Path

Surprisingly, path training actually does better on the single-edge task as well, across both datasets and all models.

The one exception is TransE on Freebase, where path training doesn’t hurt, but doesn’t help much.

Page 119: Traversing Knowledge Graphs in Vector Space (full)

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASKportugal/location/language?

(pessoa, language, english)?

Now, we definitely want an explanation for why path training does better even when the train and test distributions don’t match.

Page 120: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 121: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 122: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 123: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)• Nickel et al, 2014

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 124: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)• Nickel et al, 2014• Neelakantan et al, 2015

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 125: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)• Nickel et al, 2014• Neelakantan et al, 2015 spouse

child

A B

C

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 126: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)• Nickel et al, 2014• Neelakantan et al, 2015 spouse

child

A B

C

child?

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 127: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help on the single-edge task?

• Paths in the graph can help infer other edges.• Lao et al, 2010 (Path Ranking Algorithm)• Nickel et al, 2014• Neelakantan et al, 2015 spouse

child

A B

C

child?

spouse (A,B) ^ child (B,C) =) child (A,C)

So, we’ve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.

This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.

To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.

Then, this greatly increases our prior belief that perhaps C is A’s child as well.

This inference can be written as a simple Horn clause.We see that the head of the Horn clause is actually a path from A to C.

Page 128: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help knowledge base completion?

The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.

If you can’t assert the body of a Horn clause, you can’t infer the head of the Horn clause.

And we know from earlier results that the single-edge model is not very good at modeling paths.

Page 129: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help knowledge base completion?

• If you cannot model paths correctly, you cannot use them to infer edges.

The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.

If you can’t assert the body of a Horn clause, you can’t infer the head of the Horn clause.

And we know from earlier results that the single-edge model is not very good at modeling paths.

Page 130: Traversing Knowledge Graphs in Vector Space (full)

Why does path training help knowledge base completion?

• If you cannot model paths correctly, you cannot use them to infer edges.

PATH SINGLE-EDGE

PATH

SINGLE-EDGE

TRAINING

TASK

The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.

If you can’t assert the body of a Horn clause, you can’t infer the head of the Horn clause.

And we know from earlier results that the single-edge model is not very good at modeling paths.

Page 131: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.

We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.

Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.

When the head and body are very similar, then the model is in some sense implementing the Horn clause.

Page 132: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

parent (A,B) ^ place of birth (B,C) =) nationality (A,C)

High-precision

Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.

We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.

Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.

When the head and body are very similar, then the model is in some sense implementing the Horn clause.

Page 133: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

parent (A,B) ^ place of birth (B,C) =) nationality (A,C)

High-precision

location (A,B) ^ borders (B,C) =) nationality (A,C)

Low-precision

Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.

We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.

Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.

When the head and body are very similar, then the model is in some sense implementing the Horn clause.

Page 134: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

parent (A,B) ^ place of birth (B,C) =) nationality (A,C)

High-precision

location (A,B) ^ borders (B,C) =) nationality (A,C)

Low-precision

child (A,B) ^ gender (B,C) =) nationality (A,C)

Zero-coverage

Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.

We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.

Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.

When the head and body are very similar, then the model is in some sense implementing the Horn clause.

Page 135: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

parent (A,B) ^ place of birth (B,C) =) nationality (A,C)

High-precision

location (A,B) ^ borders (B,C) =) nationality (A,C)

Low-precision

child (A,B) ^ gender (B,C) =) nationality (A,C)

Zero-coverage

Wparent

Wplace of birth

⇠ Wnationality

Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.

We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.

Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.

When the head and body are very similar, then the model is in some sense implementing the Horn clause.

Page 136: Traversing Knowledge Graphs in Vector Space (full)

Path training helps learn Horn clauses

We found that when a Horn clause has high precision, path training pulls the body and head of the Horn clause closer together.

When it is low precision, path training does not care too much.

And when a Horn clause has zero coverage, path training almost has no effect in most cases.

So this definitely provides some interesting evidence in favor of our hypothesis, but we certainly think this could be investigated more carefully.

Page 137: Traversing Knowledge Graphs in Vector Space (full)

Graph databases Vector space models

Compositional queries

Handle incompleteness

To recap…

So, to recap! We’ve demonstrated how to take existing vector space models and generalize them to handle path queries.

In the process, we also discovered that path training leads to stronger performance on predicting missing facts.

We have some theories on why this is, but there is still more to investigate.

Page 138: Traversing Knowledge Graphs in Vector Space (full)

Connections and

speculative thoughts

Since we still have a little time, I’d like to end with some connections to other work, and speculative thoughts.

Page 139: Traversing Knowledge Graphs in Vector Space (full)

RNNs

The first connection is to Recurrent Neural Networks.

Repeated application of traversal operators can be thought of as implementing a recurrent neural network.

There’s been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.

Page 140: Traversing Knowledge Graphs in Vector Space (full)

RNNs• Repeated application of traversal operators can be

thought of as implementing a recurrent neural network.

The first connection is to Recurrent Neural Networks.

Repeated application of traversal operators can be thought of as implementing a recurrent neural network.

There’s been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.

Page 141: Traversing Knowledge Graphs in Vector Space (full)

RNNs• Repeated application of traversal operators can be

thought of as implementing a recurrent neural network.

Trk � . . . � Tr2 � Tr1 (xs)

The first connection is to Recurrent Neural Networks.

Repeated application of traversal operators can be thought of as implementing a recurrent neural network.

There’s been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.

Page 142: Traversing Knowledge Graphs in Vector Space (full)

RNNs• Repeated application of traversal operators can be

thought of as implementing a recurrent neural network.

Trk � . . . � Tr2 � Tr1 (xs)

• Initial hidden state: xs

• Inputs: r1, r2, …, rk

The first connection is to Recurrent Neural Networks.

Repeated application of traversal operators can be thought of as implementing a recurrent neural network.

There’s been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.

Page 143: Traversing Knowledge Graphs in Vector Space (full)

RNNs• Repeated application of traversal operators can be

thought of as implementing a recurrent neural network.

Trk � . . . � Tr2 � Tr1 (xs)

• Initial hidden state: xs

• Inputs: r1, r2, …, rk

[Neelakantan et al, 2015], [Graves et al, 2013]

The first connection is to Recurrent Neural Networks.

Repeated application of traversal operators can be thought of as implementing a recurrent neural network.

There’s been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.

Page 144: Traversing Knowledge Graphs in Vector Space (full)

Matrix factorization

x x

W

E

ET

[Nickel et al, 2013], [Bordes et al, 2011]

One angle we did not cover much at all is the interpretation of knowledge graph embedding as low rank tensor or matrix factorization.

In this view, we are taking an adjacency matrix and factorizing it into 3 parts.

A matrix of entity embeddings E, W the compressed adjacency matrix, and E again

Page 145: Traversing Knowledge Graphs in Vector Space (full)

Matrix factorization

x x

W

E

ET

3

x x

WW

With path training, it as if we raised the adjacency matrix to a few powers, and are now performing low-rank matrix factorization on the higher-degree adjacency matrix.

Page 146: Traversing Knowledge Graphs in Vector Space (full)

Graph-like data• All kinds of datasets are graphs

finished_giving

open domain relations

kelvin

his_talk

okay

was_considered

[Riedel et al, 2013], [Fader et al, 2011], [Banko et al 2007]

The final connection I want to make is to other graph-like data.There are many data sources that have graph-like structure.

For example, open-domain relations also form a knowledge graph,and it would be interesting to see if path training can be helpful there too.

Page 147: Traversing Knowledge Graphs in Vector Space (full)

Graph-like data• All kinds of datasets are graphs

“the boy happily ate ice cream”

“the child enjoyed dessert”

“the child hated dessert”

entailment

negation

textual entailment

[Bowman et al, 2015], [Dagan et al, 2009]

Textual entailment is another domain that has graph like structure.Here, each node in the graph is actually a sentence, and each edge is an entailment relation.

Since we can embed sentences into vector space, path training might serve as a useful form of regularization on the sentence embedding function.

Page 148: Traversing Knowledge Graphs in Vector Space (full)

Graph-like data• All kinds of datasets are graphs

0.1

word co-occurrences

king

throne

queen

0.4

[Levy, 2014], [Pennington, 2014], [Mikolov, 2013]

Finally, word co-occurrence probabilities form a very dense graph structure, and many word embedding models have been linked to matrix factorization, so it would be interesting to see if path training could be helpful there as well.

Page 149: Traversing Knowledge Graphs in Vector Space (full)

Thank you!