Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max...

93
Representing and comparing probabilities: Part 2 Arthur Gretton Gatsby Computational Neuroscience Unit, University College London UAI, 2017 1/52

Transcript of Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max...

Page 1: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Representing and comparing probabilities: Part 2

Arthur Gretton

Gatsby Computational Neuroscience Unit,University College London

UAI, 2017

1/52

Page 2: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Testing against a probabilistic model

2/52

Page 3: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

MMD(P ;Q) = kf �k2 = supkf kF�1[EQ f � Epf ]

-4 -2 2 4

-0.3

-0.2

-0.1

0.1

0.2

0.3

0.4

p(x)

q(x)

f *(x)

f �(x ) is the witness functionCan we compute MMD with samples from Q and a model P?Problem: usualy can’t compute Epf in closed form.

3/52

Page 4: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea

To get rid of Epf insup

kf kF�1[Eq f � Epf ]

we define the Stein operator

[Tpf ] (x ) =1

p(x )ddx

(f (x )p(x ))

ThenEPTP f = 0

subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016)

4/52

Page 5: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea: proof

Ep [Tpf ] =Z �

1p(x )

ddx

(f (x )p(x ))�p(x )dxZ �

ddx

(f (x )p(x ))�dx

= [f (x )p(x )]1�1= 0

5/52

Page 6: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea: proof

Ep [Tpf ] =Z �

1

���p(x )ddx

(f (x )p(x ))��

��p(x )dxZ �ddx

(f (x )p(x ))�dx

= [f (x )p(x )]1�1= 0

5/52

Page 7: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea: proof

Ep [Tpf ] =Z �

1

���p(x )ddx

(f (x )p(x ))��

��p(x )dxZ �ddx

(f (x )p(x ))�dx

= [f (x )p(x )]1�1= 0

5/52

Page 8: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea: proof

Ep [Tpf ] =Z �

1

���p(x )ddx

(f (x )p(x ))��

��p(x )dxZ �ddx

(f (x )p(x ))�dx

= [f (x )p(x )]1�1= 0

5/52

Page 9: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Stein idea: proof

Ep [Tpf ] =Z �

1

���p(x )ddx

(f (x )p(x ))��

��p(x )dxZ �ddx

(f (x )p(x ))�dx

= [f (x )p(x )]1�1= 0

5/52

Page 10: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel Stein DiscrepancyStein operator

Tpf = @x f + f @x (log p)

Kernel Stein Discrepancy (KSD)

KSD(p; q ;F) = supkgkF�1

EqTpg � EpTpg

6/52

Page 11: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel Stein DiscrepancyStein operator

Tpf = @x f + f @x (log p)

Kernel Stein Discrepancy (KSD)

KSD(p; q ;F) = supkgkF�1

EqTpg �����EpTpg = supkgkF�1

EqTpg

6/52

Page 12: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel Stein DiscrepancyStein operator

Tpf = @x f + f @x (log p)

Kernel Stein Discrepancy (KSD)

KSD(p; q ;F) = supkgkF�1

EqTpg �����EpTpg = supkgkF�1

EqTpg

-4 -2 2 4

-0.6

-0.4

-0.2

0.2

0.4

p(x)

q(x)

g *(x)

6/52

Page 13: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel Stein DiscrepancyStein operator

Tpf = @x f + f @x (log p)

Kernel Stein Discrepancy (KSD)

KSD(p; q ;F) = supkgkF�1

EqTpg �����EpTpg = supkgkF�1

EqTpg

-4 -2 2 4

0.1

0.2

0.3

0.4

p(x)

q(x)

g *(x)

6/52

Page 14: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel stein discrepancyClosed-form expression for KSD: given Z ;Z 0 � q , then(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)

KSD(p; q ;F) = Eqhp(Z ;Z 0)

where

hp(x ; y) := @x log p(x )@x log p(y)k(x ; y)

+ @y log p(y)@xk(x ; y)

+ @x log p(x )@yk(x ; y)

+ @x@yk(x ; y)

and k is RKHS kernel for F

Only depends on kernel and @x log p(x ). Do not need tonormalize p, or sample from it.

7/52

Page 15: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

Chicago crime data

8/52

Page 16: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

Chicago crime dataModel is Gaussian mixture with two components.

8/52

Page 17: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

Chicago crime dataModel is Gaussian mixture with two components

Stein witness function

8/52

Page 18: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

Chicago crime dataModel is Gaussian mixture with ten components.

8/52

Page 19: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Statistical model criticism

Chicago crime dataModel is Gaussian mixture with ten components

Stein witness functionCode: https://github.com/karlnapf/kernel_goodness_of_fit 8/52

Page 20: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Kernel stein discrepancyFurther applications:

Evaluation of approximate MCMC methods.(Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)

What kernel to use?

The inverse multiquadric kernel,

k(x ; y) =�c + kx � yk2

2

��for � 2 (�1; 0).

9/52

Page 21: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Testing statistical dependence

10/52

Page 22: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Dependence testing

Given: Samples from a distribution PXY

Goal: Are X and Y independent?

Theirnosesguidethemthroughlife,andthey'reneverhappierthanwhenfollowinganinterestingscent.

Alargeanimalwhoslingsslobber,exudesadistinctivehoundy odor,andwantsnothingmorethantofollowhisnose.

Textfromdogtime.com andpetfinder.com

A responsive,interactivepet,onethatwillblowinyourearandfollowyoueverywhere.

YX

11/52

Page 23: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure?

Could we use MMD?

MMD(PXY| {z }P

;PXPY| {z }Q

;H�)

We don’t have samples from Q := PXPY , only pairsf(xi ; yig

ni=1

i:i:d:� PXY

� Solution: simulate Q with pairs (xi ; yj ) for j 6= i .

What kernel � to use for the RKHS H�?

12/52

Page 24: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure?

Could we use MMD?

MMD(PXY| {z }P

;PXPY| {z }Q

;H�)

We don’t have samples from Q := PXPY , only pairsf(xi ; yig

ni=1

i:i:d:� PXY

� Solution: simulate Q with pairs (xi ; yj ) for j 6= i .

What kernel � to use for the RKHS H�?

12/52

Page 25: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure?

Could we use MMD?

MMD(PXY| {z }P

;PXPY| {z }Q

;H�)

We don’t have samples from Q := PXPY , only pairsf(xi ; yig

ni=1

i:i:d:� PXY

� Solution: simulate Q with pairs (xi ; yj ) for j 6= i .

What kernel � to use for the RKHS H�?

12/52

Page 26: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measureKernel k on images with feature space F ,

Kernel l on captions with feature space G,

13/52

Page 27: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measureKernel k on images with feature space F ,

Kernel l on captions with feature space G,

Kernel � on image-text pairs: are images and captions similar?

13/52

Page 28: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure

Given: Samples from a distribution PXY

Goal: Are X and Y independent?

MMD2( bPXY ; bPX bPY ;H�) :=1n2 trace(KL)

( K, L column centered)

14/52

Page 29: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure

Given: Samples from a distribution PXY

Goal: Are X and Y independent?

MMD2( bPXY ; bPX bPY ;H�) :=1n2 trace(KL)

14/52

Page 30: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

MMD as a dependence measure

Two questions:

Why the product kernel? Many ways to combine kernels - why noteg a sum?

Is there a more interpretable way of defining this dependencemeasure?

15/52

Page 31: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Illustration: two variables with no correlation but strong dependence.

-2 -1 0 1 2

-1.5

-1

-0.5

0

0.5

1

1.5Correlation: 0.00

16/52

Page 32: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Illustration: two variables with no correlation but strong dependence.

-2 -1 0 1 2

-1.5

-1

-0.5

0

0.5

1

1.5Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

-2 0 2

-1

-0.5

0

0.5

16/52

Page 33: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Illustration: two variables with no correlation but strong dependence.

-2 -1 0 1 2

-1.5

-1

-0.5

0

0.5

1

1.5Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

-2 0 2

-1

-0.5

0

0.5

-1 -0.5 0 0.5

-1

-0.5

0

0.5Correlation: 0.90

16/52

Page 34: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Define two spaces, one for each witness

Function in F

f (x ) =1X

j=1

fj'j (x )

Feature map

Kernel for RKHS F on X :

k(x ; x 0) = h'(x ); '(x 0)iF

Function in G

g(y) =1X

j=1

gj�j (y)

Feature map

Kernel for RKHS G on Y:

l(x ; x 0) = h�(y); �(y 0)iG17/52

Page 35: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The constrained covarianceThe constrained covariance is

COCO(PXY ) = supkf kF � 1kgkG � 1

cov[f (x )g(y)]

-2 -1 0 1 2

-1.5

-1

-0.5

0

0.5

1

1.5Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

-2 0 2

-1

-0.5

0

0.5

-1 -0.5 0 0.5

-1

-0.5

0

0.5Correlation: 0.90

18/52

Page 36: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The constrained covarianceThe constrained covariance is

COCO(PXY ) = supkf kF � 1kgkG � 1

cov

240@ 1Xj=1

fj'j (x )

1A0@ 1Xj=1

gj�j (y)

1A35

18/52

Page 37: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The constrained covarianceThe constrained covariance is

COCO(PXY ) = supkf kF � 1kgkG � 1

Exy

240@ 1Xj=1

fj'j (x )

1A0@ 1Xj=1

gj�j (y)

1A35

Fine print: feature mappings '(x ) and �(y) assumed to have zero mean.

18/52

Page 38: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The constrained covarianceThe constrained covariance is

COCO(PXY ) = supkf kF � 1kgkG � 1

Exy

240@ 1Xj=1

fj'j (x )

1A0@ 1Xj=1

gj�j (y)

1A35

Fine print: feature mappings '(x ) and �(y) assumed to have zero mean.

Rewriting:

Exy [f (x )g(y)]

=

2664f1f2...

3775>

Exy

0BB@2664'1(x )'2(x )

...

3775 h �1(y) �2(y) : : :i1CCA

| {z }C'(x)�(y)

2664g1

g2...

3775

18/52

Page 39: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The constrained covarianceThe constrained covariance is

COCO(PXY ) = supkf kF � 1kgkG � 1

Exy

240@ 1Xj=1

fj'j (x )

1A0@ 1Xj=1

gj�j (y)

1A35

Fine print: feature mappings '(x ) and �(y) assumed to have zero mean.

Rewriting:

Exy [f (x )g(y)]

=

2664f1f2...

3775>

Exy

0BB@2664'1(x )'2(x )

...

3775 h �1(y) �2(y) : : :i1CCA

| {z }C'(x)�(y)

2664g1

g2...

3775

COCO: max singular value of feature covariance C'(x )�(y)18/52

Page 40: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Computing COCO in practiceGiven sample f(xi ; yi )g

ni=1

i:i:d:� PXY , what is empirical \COCO ?

kernels are computed with empirically centered features '(x )� 1n

Pni=1 '(xi ) and

�(y)� 1n

Pni=1 �(yi ).

AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.

Schoelkopf, and N. Logothetis, AISTATS’05

19/52

Page 41: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Computing COCO in practiceGiven sample f(xi ; yi )g

ni=1

i:i:d:� PXY , what is empirical \COCO ?

\COCO is largest eigenvalue max of"0 1

n KL1n LK 0

# "�

#=

"K 00 L

# "�

#:

Kij = k(xi ; xj ) and Lij = l(yi ; yj ).

Fine print: kernels are computed with empirically centered features '(x )� 1n

Pni=1 '(xi )

and �(y)� 1n

Pni=1 �(yi ).

AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.

Schoelkopf, and N. Logothetis, AISTATS’05

19/52

Page 42: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Computing COCO in practiceGiven sample f(xi ; yi )g

ni=1

i:i:d:� PXY , what is empirical \COCO ?

\COCO is largest eigenvalue max of"0 1

n KL1n LK 0

# "�

#=

"K 00 L

# "�

#:

Kij = k(xi ; xj ) and Lij = l(yi ; yj ).

Witness functions (singular vectors):

f (x ) /mX

i=1

�ik(xi ; x ) g(y) /nX

i=1

�i l(yi ; y)

Fine print: kernels are computed with empirically centered features '(x )� 1n

Pni=1 '(xi )

and �(y)� 1n

Pni=1 �(yi ).

AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.

Schoelkopf, and N. Logothetis, AISTATS’05

19/52

Page 43: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

What is a large dependence with COCO?

−2 0 2−3

−2

−1

0

1

2

3

X

Y

Smooth density

−4 −2 0 2 4−4

−2

0

2

4

X

Y

500 Samples, smooth density

−2 0 2−3

−2

−1

0

1

2

3

XY

Rough density

−4 −2 0 2 4−4

−2

0

2

4

X

Y

500 samples, rough density

Density takes the form:

PXY / 1+sin(!x ) sin(!y)

Which of these is the more “dependent”?

20/52

Page 44: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! = 1:

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.31

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.50 COCO: 0.09

21/52

Page 45: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! = 2:

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.02

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.54 COCO: 0.07

22/52

Page 46: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! = 3:

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.03

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.44 COCO: 0.04

23/52

Page 47: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! = 4:

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.05

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.25 COCO: 0.02

24/52

Page 48: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! =??:

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.01

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.14 COCO: 0.02

25/52

Page 49: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Finding covariance with smooth transformations

Case of ! = 0: uniform noise! (shows bias)

-4 -2 0 2 4

-4

-2

0

2

4Correlation: 0.01

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-0.5 0 0.5

-0.5

0

0.5

Correlation: 0.14 COCO: 0.02

26/52

Page 50: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Dependence largest when at “low” frequencies

As dependence is encoded at higher frequencies, the smoothmappings f ; g achieve lower linear dependence.

Even for independent variables, COCO will not be zero at finitesample sizes, since some mild linear dependence will be found by f,g(bias)

This bias will decrease with increasing sample size.

27/52

Page 51: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Can we do better than COCO?A second example with zero correlation.First singular value of feature covariance C'(x )�(y):

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

-2 0 2

-1

-0.5

0

0.5

-1 -0.5 0 0.5

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

Correlation: 0.80 COCO1: 0.11

28/52

Page 52: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Can we do better than COCO?A second example with zero correlation.Second singular value of feature covariance C'(x )�(y):

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

28/52

Page 53: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Can we do better than COCO?A second example with zero correlation.Second singular value of feature covariance C'(x )�(y):

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Correlation: 0.00

-2 0 2

-1

-0.5

0

0.5

1

-2 0 2

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-0.5

0

0.5

Correlation: 0.37 COCO2: 0.06

28/52

Page 54: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The Hilbert-Schmidt Independence Criterion

Writing the ith singular value of the feature covariance C'(x )�(y) as

i := COCOi (PXY ;F ;G);

define Hilbert-Schmidt Independence Criterion (HSIC)

HSIC 2(PXY ;F ;G) =1Xi=1

2i :

AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.Schoelkopf., and A. Smola, NIPS 2007,.

29/52

Page 55: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

The Hilbert-Schmidt Independence Criterion

Writing the ith singular value of the feature covariance C'(x )�(y) as

i := COCOi (PXY ;F ;G);

define Hilbert-Schmidt Independence Criterion (HSIC)

HSIC 2(PXY ;F ;G) =1Xi=1

2i :

AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.Schoelkopf., and A. Smola, NIPS 2007,.

HSIC is MMD with product kernel!

HSIC 2(PXY ;F ;G) = MMD2(PXY ;PXPY ;H�)

where �((x ; y); (x 0; y 0)) = k(x ; x 0)l(y ; y 0).

29/52

Page 56: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Asymptotics of HSIC under independence

Given sample f(xi ; yigni=1

i:i:d:� PXY , what is empirical\HSIC ?

Empirical HSIC (biased)

\HSIC =1n2 trace(KL)

Kij = k(xi ; xj ) and Lij = l(yiyj ) (K and L computed withempirically centered features)

Statistical testing: given PXY = PXPY , what is the threshold c�such that P(\HSIC > c�) < � for small �?Asymptotics of\HSIC when PXY = PXPY :

n\HSIC D!

1Xl=1

�lz 2l ; zl � N (0; 1)i:i:d:

where �l l (zj ) =R

hijqr l (zi )dFi;q;r ; hijqr = 14!

P(i;j ;q;r)(t;u;v ;w)

ktu ltu + ktu lvw � 2ktu ltv

30/52

Page 57: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Asymptotics of HSIC under independence

Given sample f(xi ; yigni=1

i:i:d:� PXY , what is empirical\HSIC ?

Empirical HSIC (biased)

\HSIC =1n2 trace(KL)

Kij = k(xi ; xj ) and Lij = l(yiyj ) (K and L computed withempirically centered features)

Statistical testing: given PXY = PXPY , what is the threshold c�such that P(\HSIC > c�) < � for small �?Asymptotics of\HSIC when PXY = PXPY :

n\HSIC D!

1Xl=1

�lz 2l ; zl � N (0; 1)i:i:d:

where �l l (zj ) =R

hijqr l (zi )dFi;q;r ; hijqr = 14!

P(i;j ;q;r)(t;u;v ;w)

ktu ltu + ktu lvw � 2ktu ltv

30/52

Page 58: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Asymptotics of HSIC under independence

Given sample f(xi ; yigni=1

i:i:d:� PXY , what is empirical\HSIC ?

Empirical HSIC (biased)

\HSIC =1n2 trace(KL)

Kij = k(xi ; xj ) and Lij = l(yiyj ) (K and L computed withempirically centered features)

Statistical testing: given PXY = PXPY , what is the threshold c�such that P(\HSIC > c�) < � for small �?Asymptotics of\HSIC when PXY = PXPY :

n\HSIC D!

1Xl=1

�lz 2l ; zl � N (0; 1)i:i:d:

where �l l (zj ) =R

hijqr l (zi )dFi;q;r ; hijqr = 14!

P(i;j ;q;r)(t;u;v ;w)

ktu ltu + ktu lvw � 2ktu ltv

30/52

Page 59: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Asymptotics of HSIC under independence

Given sample f(xi ; yigni=1

i:i:d:� PXY , what is empirical\HSIC ?

Empirical HSIC (biased)

\HSIC =1n2 trace(KL)

Kij = k(xi ; xj ) and Lij = l(yiyj ) (K and L computed withempirically centered features)

Statistical testing: given PXY = PXPY , what is the threshold c�such that P(\HSIC > c�) < � for small �?Asymptotics of\HSIC when PXY = PXPY :

n\HSIC D!

1Xl=1

�lz 2l ; zl � N (0; 1)i:i:d:

where �l l (zj ) =R

hijqr l (zi )dFi;q;r ; hijqr = 14!

P(i;j ;q;r)(t;u;v ;w)

ktu ltu + ktu lvw � 2ktu ltv

30/52

Page 60: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A statistical test

Given PXY = PXPY , what is the threshold c� such thatP(\HSIC > c�) < � for small � (prob. of false positive)?

Original time series:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

Permutation:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10

Null distribution via permutation� Compute HSIC for fxi ; y�(i)gn

i=1 for random permutation � of indicesf1; : : : ;ng. This gives HSIC for independent variables.

� Repeat for many different permutations, get empirical CDF� Threshold c� is 1� � quantile of empirical CDF 31/52

Page 61: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A statistical test

Given PXY = PXPY , what is the threshold c� such thatP(\HSIC > c�) < � for small � (prob. of false positive)?

Original time series:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

Permutation:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10

Null distribution via permutation� Compute HSIC for fxi ; y�(i)gn

i=1 for random permutation � of indicesf1; : : : ;ng. This gives HSIC for independent variables.

� Repeat for many different permutations, get empirical CDF� Threshold c� is 1� � quantile of empirical CDF 31/52

Page 62: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A statistical test

Given PXY = PXPY , what is the threshold c� such thatP(\HSIC > c�) < � for small � (prob. of false positive)?

Original time series:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

Permutation:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10

Null distribution via permutation� Compute HSIC for fxi ; y�(i)gn

i=1 for random permutation � of indicesf1; : : : ;ng. This gives HSIC for independent variables.

� Repeat for many different permutations, get empirical CDF� Threshold c� is 1� � quantile of empirical CDF 31/52

Page 63: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Application: dependence detection across languagesTesting task: detect dependence between English and French text

Lesordresdegouvernementsprovinciauxetmunicipauxsubissentdefortespressions

Honourablesenators,IhaveaquestionfortheLeaderoftheGovernmentintheSenate

Textfromthealignedhansards ofthe36th parliamentofcanada,https://www.isi.edu/natural-language/download/hansard/

YXHonorablessénateurs,maquestions’adresseauleaderdugouvernementauSénat

Aucontraire,nousavonsaugmentelefinancementfédéralpourledéveloppementdesjeunes

Nodoubtthereisgreatpressureonprovincialandmunicipalgovernments

Infact,wehaveincreasedfederalinvestmentsforearlychildhooddevelopment.

......

32/52

Page 64: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Application: dependence detection across languagesTesting task: detect dependence between English and French textk -spectrum kernel, k = 10, sample size n = 10

\HSIC =1n2 trace(KL)

(K and L column centered) 33/52

Page 65: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Application:Dependence detection across languages

Results (for � = 0:05)

k-spectrum kernel: average Type II error 0

Bag of words kernel: average Type II error 0.18

Settings: Five line extracts, averaged over 300 repetitions, for“Agriculture” transcripts. Similar results for Fisheries andImmigration transcripts.

34/52

Page 66: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Testing higher order interactions

35/52

Page 67: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Detecting higher order interaction

How to detect V-structures with pairwise weak individualdependence?

X Y

Z

36/52

Page 68: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Detecting higher order interaction

How to detect V-structures with pairwise weak individualdependence?

36/52

Page 69: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Detecting higher order interactionHow to detect V-structures with pairwise weak individualdependence?

X ?? Y ;Y ?? Z ;X ?? ZX1 vs Y1 Y1 vs Z1

X1 vs Z1 X1*Y1 vs Z1

X Y

Z

X ;Y i:i:d:� N (0; 1)

Z j X ;Y � sign(XY )Exp( 1p2)

Fine print: Faithfulness violated here!

36/52

Page 70: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

V-structure discovery

X Y

Z

Assume X ?? Y has been established.V-structure can then be detected by:

Consistent CI test: H0 : X ?? Y jZ [Fukumizu et al. 2008, Zhang et al. 2011]

Factorisation test: H0 : (X ;Y ) ?? Z _ (X ;Z ) ?? Y _ (Y ;Z ) ?? X(multiple standard two-variable tests)

How well do these work?37/52

Page 71: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Detecting higher order interaction

Generalise earlier example to p dimensions

X ?? Y ;Y ?? Z ;X ?? ZX1 vs Y1 Y1 vs Z1

X1 vs Z1 X1*Y1 vs Z1

X Y

Z

X ;Y i:i:d:� N (0; 1)

Z j X ;Y � sign(XY )Exp( 1p2)

X2:p ;Y2:p ;Z2:pi :i :d :� N (0; Ip�1)

Fine print: Faithfulness violated here!

38/52

Page 72: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

V-structure discovery

CI test for X ?? Y jZ from Zhang et al. (2011), and a factorisationtest, n = 500

39/52

Page 73: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measureLancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

40/52

Page 74: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measureLancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

D = 3 : �LP = PXYZ �PXPYZ �PY PXZ �PZPXY +2PXPY PZ

40/52

Page 75: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measure

Lancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

D = 3 : �LP = PXYZ �PXPYZ �PY PXZ �PZPXY +2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP =

40/52

Page 76: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measureLancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

D = 3 : �LP = PXYZ �PXPYZ �PY PXZ �PZPXY +2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PXZPY −PXYPZ +2PXPY PZ

∆LP = 0

Case of PX ?? PYZ

40/52

Page 77: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measure

Lancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

D = 3 : �LP = PXYZ �PXPYZ �PY PXZ �PZPXY +2PXPY PZ

(X ;Y ) ?? Z _ (X ;Z ) ?? Y _ (Y ;Z ) ?? X ) �LP = 0:

...so what might be missed?

40/52

Page 78: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Lancaster interaction measureLancaster interaction measure of (X1; : : : ;XD) � P is a signedmeasure �P that vanishes whenever P can be factorised non-trivially.

D = 2 : �LP = PXY � PXPY

D = 3 : �LP = PXYZ �PXPYZ �PY PXZ �PZPXY +2PXPY PZ

�LP = 0; (X ;Y ) ?? Z _ (X ;Z ) ?? Y _ (Y ;Z ) ?? X

Example:

P(0; 0; 0) = 0:2 P(0; 0; 1) = 0:1 P(1; 0; 0) = 0:1 P(1; 0; 1) = 0:1P(0; 1; 0) = 0:1 P(0; 1; 1) = 0:1 P(1; 1; 0) = 0:1 P(1; 1; 1) = 0:2

40/52

Page 79: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A kernel test statistic using Lancaster Measure

Construct a test by estimating k�� (�LP)k2H�

; where � = k l m :

k��(PXYZ � PXY PZ � � � � )k2H�

=

h��PXYZ ; ��PXYZ iH�� 2 h��PXYZ ; ��PXY PZ iH�

� � �

41/52

Page 80: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A kernel test statistic using Lancaster Measure

Table: V -statistic estimators of h���; ��� 0iH�

(without terms PXPY PZ ). His centering matrix I � n�1

Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13

k�� (�LP)k2H�

=1n2 (HKH �HLH �HMH )++ :

Empirical joint central moment in the feature space 42/52

Page 81: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

A kernel test statistic using Lancaster Measure

Table: V -statistic estimators of h���; ��� 0iH�

(without terms PXPY PZ ). His centering matrix I � n�1

Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13

k�� (�LP)k2H�

=1n2 (HKH �HLH �HMH )++ :

Empirical joint central moment in the feature space 42/52

Page 82: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

V-structure discovery

Lancaster test, CI test for X ?? Y jZ from Zhang et al. (2011), and afactorisation test, n = 500 43/52

Page 83: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Interaction for D � 4

Interaction measure valid for all D :(Streitberg, 1990)

�SP =X�

(�1)j�j�1 (j�j � 1)!J�P

� For a partition �, J� associates to thejoint the corresponding factorisation,e.g., J13j2j4P = PX1X3PX2PX4 :

44/52

Page 84: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Interaction for D � 4

Interaction measure valid for all D :(Streitberg, 1990)

�SP =X�

(�1)j�j�1 (j�j � 1)!J�P

� For a partition �, J� associates to thejoint the corresponding factorisation,e.g., J13j2j4P = PX1X3PX2PX4 :

44/52

Page 85: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Interaction for D � 4

Interaction measure valid for all D :(Streitberg, 1990)

�SP =X�

(�1)j�j�1 (j�j � 1)!J�P

� For a partition �, J� associates to thejoint the corresponding factorisation,e.g., J13j2j4P = PX1X3PX2PX4 :

1e+04

1e+09

1e+14

1e+19

1 3 5 7 9 11 13 15 17 19 21 23 25D

Nu

mb

er

of

pa

rtitio

ns o

f {1

,...

,D} Bell numbers growth

44/52

Page 86: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Part 4: Advanced topics

45/52

Page 87: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Advanced topics

testing on time series

testing for conditional dependence

regression and conditional mean embedding

46/52

Page 88: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Measures of divergence

47/52

Page 89: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Measures of divergence

48/52

Page 90: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Measures of divergence

49/52

Page 91: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Measures of divergence

50/52

Page 92: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Measures of divergence

Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012)

51/52

Page 93: Arthur Gretton - UAIauai.org/uai2017/media/tutorials/gretton_2.pdf · COCO\ islargesteigenvalue max of " 0 1 n KL 1 n LK 0 #" # = " K 0 0 L #" #: K ij = k( x i; x j) andL ij = l(

Co-authorsFrom Gatsby:Kacper Chwialkowski

Wittawat Jitkrittum

Bharath Sriperumbudur

Heiko Strathmann

Dougal Sutherland

Zoltan Szabo

Wenkai Xu

External collaborators:Kenji Fukumizu

Bernhard Schoelkopf

Alex Smola

Questions?

52/52