Download - Introduction to Categorical Compositional Distributional Semantics Lecture · PDF fileIntroduction to Categorical Compositional Distributional Semantics ... We’ve described compositional

Introduction to Categorical CompositionalDistributional Semantics

Lecture 4: Using Density Matrices for Ambiguityand Entailment

Dimitri Kartsaklis1 Martha Lewis2

1 Department of Theoretical and Applied LinguisticsUniversity of Cambridge

2Department of Computer ScienceUniversity of Oxford

ESSLLI 2017

Toulouse, France

D. Kartsaklis, M. Lewis Introduction to CCDS - Lecture 4: Density Matrices 1/47

What we’ve seen so far

We’ve described compositional and distributional models, andlooked at elementary category theory.

We’ve combined these together to give the core of the CCDSmodel.

We’ve described some of the tasks that CCDS can be used for.

We’ve seen how adding extra structure in the form ofFrobenius algebras allows us to model functional words suchas prepositions, relative pronouns and so on.


This talk in a nutshell

Inspired by categorical quantum mechanics, we show how theCCDS model can be extended.

We show how lexical ambiguity can be modelled, and howusing ambiguous words in a sentence can disambiguate them.

We discuss lexical entailment, and use an order on densitymatrices to give a type of entailment.

We show how the notion of lexical entailment we use interactswell with compositionality.


Outline

1 Composition and lexical ambiguity

2 Open system quantum semantics

3 Textual entailment

4 Graded hyponymy

5 From theory to practice


Ambiguity in word spaces

Compositional distributional models of meaning are mainly basedon ambiguous semantic spaces:

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.80.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

donor transplantliver

transplantation

kidney

lung

organ (medicine)

accompaniment

bass

orchestra

hymn

recital

violin

concert

organ (music)

organ

∗real vectors projected onto a 2-dimensional space using MDS


Lexical ambiguity and composition

Is this the best we can do?

−−−→organ = −−−→organmusic +−−−→organmed

Then why not having vectors like this:

−−−→guitar +

−−−−→kidney

...or even this?

−−−→book +

−−−−→banana

Kartsaklis and Sadrzadeh (EMNLP 2013, ACL 2014):

Using a step of prior disambiguation on the word vectors/tensorsbefore the composition improves the quality of the composedvectors.


Homonymy and polysemy (1/2)

We distinguish between two types of lexical ambiguity:

In cases of homonymy (organ, bank, vessel etc.), due to somehistorical accident the same word is used to describe two (ormore) completely unrelated concepts.

Polysemy relates to subtle deviations between the differentsenses of the same word.

Example:

The distinction between the financial sense and the river senseof bank is a case of homonymy;

Within the financial sense, a distinction between the abstractconcept of bank as an institution and the concrete building isa case of polysemy.


Homonymy and polysemy (2/2)

Example #1: “I went to the bank to open a savings account”

The word bank is used with its financial sense

The sayer refers to both of the polysemous meanings ofbankfin (institution and building) at the same time

Example #2: “I went to the bank”

The word bank is probably used with the financial sense inmind (because most of the time this is the case)

However, a small possibility that the sayer has actually visiteda river bank still exists

Main point:

Polysemy: Relatively coherent and self-contained conceptsHomonymy: Lack of specification


Setting our goals

The problem:

How can we formalize the explicit treatment of lexical ambiguity inthe categorical model of Coecke et al?

We seek a model that will allow us:

1 to express homonymous words as probabilistic mixings of theirindividual meanings;

2 to retain the ambiguity until the presence of sufficient contextthat will eventually resolve it during composition time;

3 to achieve all the above in the multi-linear setting imposed bythe vector space semantics of our original model.


Outline




4 Graded hyponymy



A little quantum theory

Quantum mechanics and distributional models of meaning areboth based on vector space semantics

The state of a quantum system is represented by a vector in aHilbert space H. Fixing a basis for H:

|ψ〉 = c1|k1〉+ c2|k2〉+ . . .+ cn|kn〉

we take |ψ〉 to be a quantum superposition of the basis states{|ki 〉}i .

i.e. the quantum system co-exists in all basis states in parallelwith strengths denoted by the corresponding weights

Such a state is called a pure state.


Word vectors as quantum states

We take words to be quantum systems, and word vectorsspecific states of these systems:

|w〉 = c1|k1〉+ c2|k2〉+ . . .+ cn|kn〉

Each element of the ONB {|ki 〉}i is essentially an atomicsymbol:

|cat〉 = 12|milk ′〉+ 8|cute ′〉+ . . .+ 0|bank ′〉

In other words, a word vector is a probability distribution overatomic symbols

|w〉 is a pure state: when word w is seen alone, it is likeco-occurring with all the basis words with strengths denotedby the various coefficients.


Encoding homonymy with mixed states

Ideally, every disjoint meaning of a homonymous word mustbe represented by a distinct pure state:

|bankfin〉 = a1|k1〉+ a2|k2〉+ . . .+ an|kn〉|bankriv 〉 = b1|k1〉+ b2|k2〉+ . . .+ bn|kn〉

{ai}i 6= {bi}i , since the financial sense and the river sense areexpected to be seen in drastically different contexts

So we have two distinct states describing the same system

We cannot be certain under which state our system may befound – we only know that the former state is more probablethan the latter

In other words, the system is better described by aprobabilistic mixture of pure states, i.e. a mixed state.


Quantum measurements

Density operators interact with observables to producequantum measurements.

Assuming a system in state |ψ〉 and an observable A witheigen-decomposition A =

∑i ei |ei 〉〈ei |:

〈A〉ψ = 〈ψ|A|ψ〉 =∑

i

|〈ei |ψ〉|2ei

Born rule: |〈ei |ψ〉|2 is the probability of observing ei as theresult of the measurement.

For a density operator ρ =∑

j pj |ψj〉〈ψj | and the sameobservable A, we have:

〈A〉ρ =∑

j

pj〈ψj |A|ψj〉 =∑

j

∑i

pj ei |〈ei |ψj〉|2 = Tr(ρA)


Complete positivity: The CPM construction

In order to apply the new formulation on the categorical model ofCoecke et al. we need:

to replace word vectors with density operators

to replace linear maps with completely positive linear maps,i.e. maps that send positive operators to positive operatorswhile respecting the monoidal structure.

Selinger (2007):

Any dagger compact closed category is associated with a categoryin which the objects are the objects of the original category, butthe maps are completely positive maps.


Categorical model of meaning: Reprise

The passage from a grammar to distributional meaning isdefined according to the following composition:

PregF−→ FHilb

L−→ CPM(FHilb)

The meaning of a sentence w1w2 . . .wn with grammaticalderivation α becomes:

L(F(α)) (ρ(w1)⊗CPM ρ(w2)⊗CPM . . .⊗CPM ρ(wn))

Composition takes this form:

Subject-intransitive verb: ρIN = TrN(ρ(v) ◦ (ρ(s)⊗ 1S ))

Adjective-noun: ρAN = TrN(ρ(adj) ◦ (1N ⊗ ρ(n)))

Subj-trans. verb-Obj: ρTS = TrN,N(ρ(v) ◦ (ρ(s)⊗ 1S ⊗ ρ(o)))


Using Frobenius algebras

Compact closed categories are simple structures: there is atensor product and ε and η maps

Kartsaklis et al. (COLING 2012), Sadrzadeh et al. (MoL2013): Take advantage of the fact that every vector spacewith a fixed basis has a Frobenius structure over it

...i.e. there exist maps for copying and deleting the basis:

∆ :: |i〉 7→ |i〉 ⊗ |i〉 µ :: |i〉 ⊗ |i〉 7→ |i〉

Advantages:

Provides a way to build tensors for relational words, and simplifiesthe calculations

Introduces a second form of composition (in case of vector spaces,element-wise vector multiplication)

Useful in modelling various linguistic phenomena, e.g. intonation(Kartsaklis and Sadrzadeh, MoL 2015)


Frobenius algebras and density operators

Complexity is reduced: tensors of order n become tensors oforder n − 2

Composition becomes the Hadamard product of densityoperators:

ρCS = ρ(s)� Tr(ρ(v) ◦ (1N ⊗ ρ(o)))

ρCO = Tr(ρ(v) ◦ (ρ(s)⊗ 1N))� ρ(o)

Piedeleu et al. (2015): The new formulation also allows fornon-commutative versions of Frobenius algebras:

ρCS = ρ(s) ◦ Tr(ρ(v) ◦ (1N ⊗ ρ(o)))

ρCO = Tr(ρ(v) ◦ (ρ(s)⊗ 1N)) ◦ ρ(o)


Outline




4 Graded hyponymy



Distributional Inclusion Hypothesis and distributionalsemantics

In distributional semantics terms, DIH means that thecontexts of u are included in the contexts of v .

Example: Imagine a toy corpus {‘a boy runs’, ‘a person runs’,‘a person sleeps’}; boy entails person since:

{run} ⊆ {run, sleep}

Intuition: A person can do everythingthat can be done by a boy, and more.

For two words u and v we say:

u ` v whenever F(−→u ) ⊆ F(−→v )


Extending feature inclusion to CDMs

In the presence of a compositional operator, feature inclusionadheres to set-theoretic properties.

For element-wise composition, we have:

F(−→v1 + · · ·+−→vn) = F(−→v1) ∪ · · · ∪ F(−→vn)

F(−→v1 � · · · � −→vn) = F(−→v1) ∩ · · · ∩ F(−→vn)

It is also the case that:

F(max(−→v1 , · · · ,−→vn)) = F(−→v1) ∪ · · · ∪ F(−→vn)

F(min(−→v1 , · · · ,−→vn)) = F(−→v1) ∩ · · · ∩ F(−→vn)


Generic feature inclusion for tensor-based composition

In the generic case we have:w11 · · · w1n

w21 · · · w2n

......

wm1 · · · wmn

× v1

...vn

By looking the matrix as a list of column vectors(−→w1,−→w2, · · · ,−→wn), the above becomes:

v1−→w1 + v2

−→w2 + · · ·+ vn−→wn

Feature set of generic tensor-based composition:

F(w ×−→v ) =⋃

vi 6=0

F(−→wi ) =⋃

i

F(−→wi ) |F(vi )


Feature inclusion for relational tensor-based model

Grefenstette and Sadrzadeh (2011) extensional approach:

Relational words:

−→adj =

∑i

−−−→Nouni

−−→verbitr =

∑i

−→Sbj i verbtr =

∑i

−→Sbj i ⊗

−−→Obj i

Compound representations:

−→an =−→adj �−−→noun −→sv =

−−→verbitr �

−−→subj svo = verbtr � (

−−→subj ⊗

−→obj)

Feature inclusion behaviour:

F(−→sv ) =⋃

i

F(−→Sbj i ) ∩ F(−→s ) F(−→vo) =

⋃i

F(−−→Obj i ) ∩ F(−→o )

F(svo) =⋃

i

F(−→Sbj i )×F(

−−→Obj i ) ∩ F(−→s )×F(−→o )


Feature inclusion for projective tensor-based models

Kartsaklis and Sadrzadeh (2016) projective approach:

Relational words:

v itv :=∑

i

−−→Sbji ⊗

−−→Sbji vvp :=

∑i

−−→Obji ⊗

−−→Obji

v trv :=∑

i

−−→Sbji ⊗

(−−→Sbji +

−−→Obji

2

)⊗−−→Obji

Sentence vector after composition:

−→sv = −→s T × v itv =∑

i

〈−−→Sbji |−→s 〉

−−→Sbji

Feature inclusion behaviour:

F(−→sv ) =⋃

i

F(−→Sbj i ) |F

(〈−→Sbji |−→s 〉

)


Preservation of entailment at the sentence level

In element-wise composition and tensor-based composition,entailment extends from word level to sentence level:

For two sentences s1 = u1 . . . un and s2 = v1 . . . vn for whichui ` vi , it is always the case that s1 ` s2.

E.g. consider two intransitive sentences for which

F(−−→subj1) ⊆ F(

−−→subj2) and F(

−−→verb1) ⊆ F(

−−→verb2); we have:

F(−−→subj1) ∩ F(

−−−→verb1) ⊆ F(

−−→subj2) and F(

−−→subj1) ∩ F(

−−−→verb1) ⊆ F(

−−→verb2)

and consequently:

F(−−→subj1) ∩ F(

−−→verb1) ⊆ F(

−−→subj2) ∩ F(

−−→verb2)

(Proof for the tensor case: Balkır, Kartsaklis, Sadrzadeh (2016))


Outline




4 Graded hyponymy



Entailment and hyponymy

We use density matrices, or more generally, positive operatorsto represent words.

Positive operators A, B have the Lowner orderingA v B ⇐⇒ B − A is positive.

We use this ordering to represent entailment, and introduce agraded version - useful for linguistic phenomena.

We will show that graded entailment lifts compositionally tosentence level.


A transitive sentence in CPM(FHilb)

Then s = The sisters enjoy drinks is given by:

JsK = (ε⊗ 1⊗ ε)(Jthe sistersK⊗ JenjoyK⊗ JdrinksK)

The sisters enjoy drinks

S NN NN=

N S N ′ N ′ N ′NNN N ′S

The sisters enjoy drinks


Entailment for Positive Operators

We order positive operators using the Lowner order andinterpret this as entailment:

A v B ⇐⇒ B − A is positive

As the Lowner order restricts to the usual ordering onprojection operators, we can embed quantum logic[Birkhoff and Von Neumann, 1936] within the poset ofprojection operators, providing a direct link to existing theory.


So how do we do graded hyponymy?

Recall that positive operators A, B have the Lowner orderingA v B ⇐⇒ B − A is positive.

We say that A is a hyponym of B if A v B

We say that A is a k-hyponym of B for a given value of k inthe range (0, 1] and write A 2k B if:

B − kA is positive

We are interested in the maximum such k .

Theorem

For positive self-adjoint matrices A, B such thatsupp(A) ⊆ supp(B), the maximum k such that B − kA ≥ 0 isgiven by 1/λ where λ is the maximum eigenvalue of B+A.


Properties of k-hyponymy

Reflexivity: k-hyponymy is reflexive for k = 1.

Symmetry: k-hyponymy is anti-symmetric for k = 1, butneither symmetric nor anti-symmetric for k ∈ (0, 1).

Transitivity: k-hyponymy satisfies a version of transitivity.Suppose A 2k B and B 2l C . Then A 2kl C .

Continuity: For A 2k B, when there is a small perturbation toA, there is a correspondingly small decrease in the value of k .The perturbation must lie in the support of B, but canintroduce off-diagonal elements.


k-hyponymy interacts well with compositionality

We would like our notion of entailment to work at thesentence level.

Since sentences are represented as positive operators, we cancompare them directly.

If sentences have similar structure, we can also give a lowerbound on the entailment strength between sentences based onthe entailment strengths between the words in the sentences.

Example

Suppose JdogK 2k JpetK and JparkK 2l JfieldK. Then

JMy dog runs in the parkK 2??? JMy pet runs in the fieldK


k-hyponymy interacts well with compositionality

Theorem

Let Φ = A1 . . .An and Ψ = B1 . . .Bn be two positive phrases ofthe same length and grammatical structure ϕ. Let theircorresponding density matrices be denoted by JA1K, . . . , JAnKand JB1K, . . . , JBnK respectively. Suppose that JAiK 2ki

JBiKfor i ∈ {1, . . . , n} and some ki ∈ (0, 1]. Then:

ϕ(Φ) 2k1···kn ϕ(Ψ).

so k1 · · · kn provides a lower bound on the extent to which ϕ(Φ)entails ϕ(Ψ)


Example

Suppose we have pure states JnibbleK, JscoffK, JcakeK, JchocolateK.The more general eat and sweets are given by:

JeatK =1

2(JnibbleK + JscoffK), JsweetsK =

1

2(JcakeK + JchocolateK)

Then

JscoffK 21/2 JeatK, JcakeK 21/2 JsweetsK

We consider the sentences:

Js1K = ϕ(JMaryK⊗ JscoffsK⊗ JcakeK)

Js2K = ϕ(JMaryK⊗ JeatsK⊗ JsweetsK)

We will show that Js1K 2kl Js2K where kl = 12 ×

12 = 1

4


Example (cont’d)

Expanding Js2K we obtain:

Js2K = ϕ(JMaryK⊗ 1

2(JnibblesK + JscoffsK)⊗ 1

2(JcakeK + JchocK))

=1

4Js1K +

1

4(ϕ(JMaryK⊗ JscoffsK⊗ JchocK)

+ ϕ(JMaryK⊗ JnibblesK⊗ JcakeK)

+ ϕ(JMaryK⊗ JnibblesK⊗ JchocK))

Therefore:

Js2K−1

4Js1K =

1

4(ϕ(JMaryK⊗ JchocK⊗ JchocK))

+ ϕ(JMaryK⊗ JnibblesK⊗ JcakeK)

+ ϕ(JMaryK⊗ JnibblesK⊗ JchocK))

We can see that Js2K− 14Js1K is positive by the fact that positivity

is preserved under addition and tensor product. Therefore:

Js1K 2kl Js2K

as required.D. Kartsaklis, M. Lewis Introduction to CCDS - Lecture 4: Density Matrices 38/47

Mixing entailment and ambiguity

CPM construction applies to any dagger compact closedcategory, even to CPM(FHilb) itself

In other words, we can have density operators of densityoperators,

...i.e. a probability distribution over a set of probabilitydistributions

We can use this fact to encode two (or more) distinct kinds ofinformation

Example: Balkır (2014) uses a form of density operators toencode textual entailment. We can encode both ambiguityand entailment information as follows:

ρ(bank) = 0.5|Bamb〉〈Bamb|+ 0.5|Bent〉〈Bent |


Outline




4 Graded hyponymy



Creating density operators

Density operators can be created with standard WSI methods.For example:

Schutze (1998):

1 Create vectors for all contexts of a target word, e.g. by averagingthe vectors of other words in the same sentence

2 Cluster those context vectors

3 Use the centroid of the produced clusters as sense vectors.

This will produce a statistical ensemble {(pi , |si 〉)}i that canbe used for creating density operators:

ρ(w) =∑

i

pi |si 〉〈si |


Measuring ambiguity

How does ambiguity evolve from homonymous words (e.g.‘nail’) to unambiguous compounds (‘rusty nail’, ‘nail thatgrows’)?

We can measure it with Von Neumann entropy.

Von Neuman entropy is zero for a pure state (i.e. acompletely unambiguous word), and ln dim(H) for amaximally mixed state.

For a density operator ρ with eigen-decompositionρ =

∑i ei |ei 〉〈ei |, Von Neumann entropy is defined as:

S(ρ) = −Tr(ρ ln ρ) = −∑

i

ei ln ei


Measuring entropy: A small-scale experiment

Relative Clausesnoun: verb1/verb2 noun noun that verb1 noun that verb2

organ: enchant/ache 0.18 0.11 0.08vessel : swell/sail 0.25 0.16 0.01queen: fly/rule 0.28 0.14 0.16nail : gleam/grow 0.19 0.06 0.14bank: overflow/loan 0.21 0.19 0.18

Adjectivesnoun: adj1/adj2 noun adj1 noun adj2 nounorgan: music/body 0.18 0.10 0.13vessel : blood/naval 0.25 0.05 0.07queen: fair/chess 0.28 0.05 0.16nail : rusty/finger 0.19 0.04 0.11bank: water/financial 0.21 0.20 0.16

An important aspect of the proposed model:

Disambiguation = Purification


Compactclosure

CategoricalQuantumMechanics

OriginalDisCo model

Open systemextension

Vector spacesemantics


Summary

Density operators offer richer semantics representations fordistributional models of meaning

From probability distributions over symbols we advance toprobability distributions over vectors

The nested levels of CPM construction is an intriguing featurethat deserves separate treatment

Density operators support a form of logic whose distributionaland compositional properties remains to be examined

Large-scale experimental evaluation currently in progress


References I

Abramsky, S. and Coecke, B. (2004).

A categorical semantics of quantum protocols.In 19th Annual IEEE Symposium on Logic in Computer Science, pages 415–425.

Balkır, E. (2014).

Using density matrices in a compositional distributional model of meaning.Master’s thesis, University of Oxford.

Birkhoff, G. and Von Neumann, J. (1936).

The logic of quantum mechanics.Annals of Mathematics, pages 823–843.

Coecke, B., Sadrzadeh, M., and Clark, S. (2010).

Mathematical Foundations for a Compositional Distributional Model of Meaning. Lambek Festschrift.Linguistic Analysis, 36:345–384.

Kartsaklis, D., Kalchbrenner, N., and Sadrzadeh, M. (2014).

Resolving lexical ambiguity in tensor regression models of meaning.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers), pages 212–217, Baltimore, Maryland. Association for Computational Linguistics.

Kartsaklis, D. and Sadrzadeh, M. (2013).

Prior disambiguation of word tensors for constructing sentence vectors.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages1590–1601, Seattle, Washington, USA. Association for Computational Linguistics.

Kartsaklis, D. and Sadrzadeh, M. (2015).

A Frobenius model of information structure in categorical compositional distributional semantics.In Proceedings of the 14th Meeting on Mathematics of Language.


References II

Kartsaklis, D., Sadrzadeh, M., and Pulman, S. (2012).

A unified sentence space for categorical distributional-compositional semantics: Theory and experiments.In Proceedings of 24th International Conference on Computational Linguistics (COLING 2012): Posters,pages 549–558, Mumbai, India. The COLING 2012 Organizing Committee.

Kartsaklis, D., Sadrzadeh, M., Pulman, S., and Coecke, B. (2015).

Reasoning about meaning in natural language with compact closed categories and Frobenius algebras.In Chubb, J., Eskandarian, A., and Harizanov, V., editors, Logic and Algebraic Structures in QuantumComputing and Information, Association for Symbolic Logic Lecture Notes in Logic. Cambridge UniversityPress.

Piedeleu, R., Kartsaklis, D., Coecke, B., and Sadrzadeh, M. (2015).

Open system categorical quantum semantics in natural language processing.arXiv preprint arXiv:1502.00831.

Sadrzadeh, M., Clark, S., and Coecke, B. (2013).

The Frobenius anatomy of word meanings I: subject and object relative pronouns.Journal of Logic and Computation, Advance Access.

Sadrzadeh, M., Clark, S., and Coecke, B. (2014).

The Frobenius anatomy of word meanings II: Possessive relative pronouns.Journal of Logic and Computation.

Selinger, P. (2007).

Dagger compact closed categories and completely positive maps.Electronic Notes in Theoretical Computer Science, 170:139–163.