Kernel Mean Embeddings - University of...
Transcript of Kernel Mean Embeddings - University of...
![Page 1: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/1.jpg)
Kernel Mean Embeddings
George Hron, Adam Scibior
Department of EngineeringUniversity of Cambridge
Reading Group, March 2017
![Page 2: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/2.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 3: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/3.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 4: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/4.jpg)
Kernels
A kernel k is a positive definite function X × X →, that is for alla1, . . . , an ∈, x1, . . . , xn ∈ X ,
n∑i ,j=1
aiajk(xi , xj) ≥ 0
Intuitively a kernel is a measure of similarity between two elementsof X .
![Page 5: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/5.jpg)
Kernels
Commonly used kernels:
I polynomial(〈x1, x2〉+ 1)d
I Gaussian RBF
e−‖x−y‖2
2σ2
I Laplace
e−‖x−y‖σ
![Page 6: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/6.jpg)
RKHS
A reproducing kernel Hilbert space (RKHS) for a kernel k is onespanned by functions k(x , ·) for all x ∈ X with the inner productdefined by
〈k(x1, ·), k(x2, ·)〉 = k(x1, x2)
which is well-defined on all of H by linearity of the inner product.
The equation above is known as the kernel trick and lets us treatH as an implicit feature space, where we never have to explicitlyevaluate the feature map.
![Page 7: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/7.jpg)
RKHS
H has the reproducing property, that is for all f ∈ H and all x ∈ X ,
〈f , k(x , ·)〉 = f (x)
![Page 8: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/8.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 9: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/9.jpg)
Expectations in RKHS
Goal: evaluating expected values of functions from an RKHS.
As with the kernel trick, it turns out that this is possible in termsof inner products, making the computation analytically tractable,
Ep
[f (x)] =
∫X
p (dx)f (x) =
∫X
p (dx)〈k (x , ·), f 〉Hk
?= 〈∫X
p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk
µp is so called kernel mean embedding (KME) of distribution p .
Note: X assumed measurable throughout the whole presentation.
![Page 10: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/10.jpg)
Expectations in RKHS
Goal: evaluating expected values of functions from an RKHS.
As with the kernel trick, it turns out that this is possible in termsof inner products, making the computation analytically tractable,
Ep
[f (x)] =
∫X
p (dx)f (x) =
∫X
p (dx)〈k (x , ·), f 〉Hk
?= 〈∫X
p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk
µp is so called kernel mean embedding (KME) of distribution p .
Note: X assumed measurable throughout the whole presentation.
![Page 11: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/11.jpg)
Expectations in RKHS
Goal: evaluating expected values of functions from an RKHS.
As with the kernel trick, it turns out that this is possible in termsof inner products, making the computation analytically tractable,
Ep
[f (x)] =
∫X
p (dx)f (x) =
∫X
p (dx)〈k (x , ·), f 〉Hk
?= 〈∫X
p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk
µp is so called kernel mean embedding (KME) of distribution p .
Note: X assumed measurable throughout the whole presentation.
![Page 12: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/12.jpg)
Existence of KME
Riesz representation theorem: For every bounded linearfunctional T : H → R (resp. T : H → C), there exists a uniqueg ∈ H such that T (f ) = 〈g , f 〉H,∀f ∈ H.
Expectation is a linear functional, and ∀f ∈ H we have,
Epf (x) ≤
∣∣∣∣Ep f (x)
∣∣∣∣ ≤ Ep
∣∣f (x)∣∣ = E
p
∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E
p
∥∥k (x , ·)∥∥H
Thus if Ep
√k (x , x) <∞, then µp ∈ H exists, and
Ep f (x) = 〈µp , f 〉H,∀f ∈ H.
![Page 13: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/13.jpg)
Existence of KME
Riesz representation theorem: For every bounded linearfunctional T : H → R (resp. T : H → C), there exists a uniqueg ∈ H such that T (f ) = 〈g , f 〉H,∀f ∈ H.
Expectation is a linear functional, and ∀f ∈ H we have,
Epf (x) ≤
∣∣∣∣Ep f (x)
∣∣∣∣ ≤ Ep
∣∣f (x)∣∣ = E
p
∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E
p
∥∥k (x , ·)∥∥H
Thus if Ep
√k (x , x) <∞, then µp ∈ H exists, and
Ep f (x) = 〈µp , f 〉H,∀f ∈ H.
![Page 14: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/14.jpg)
Existence of KME
Riesz representation theorem: For every bounded linearfunctional T : H → R (resp. T : H → C), there exists a uniqueg ∈ H such that T (f ) = 〈g , f 〉H,∀f ∈ H.
Expectation is a linear functional, and ∀f ∈ H we have,
Epf (x) ≤
∣∣∣∣Ep f (x)
∣∣∣∣ ≤ Ep
∣∣f (x)∣∣ = E
p
∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E
p
∥∥k (x , ·)∥∥H
Thus if Ep
√k (x , x) <∞, then µp ∈ H exists, and
Ep f (x) = 〈µp , f 〉H,∀f ∈ H.
![Page 15: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/15.jpg)
Estimating KME
Because µp is generally unknown, it has to be estimated.
A natural (and minimax optimal) estimator is the sample mean,
µp :=1
N
N∑n=1
k (xn, ·)
and in particular,
µp (x) = 〈µp , k (x , ·)〉H =1
N
N∑n=1
k (xn, x)N→∞−−−−→ E
p (x ′)k (x ′, x)
![Page 16: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/16.jpg)
Estimating KME
Because µp is generally unknown, it has to be estimated.
A natural (and minimax optimal) estimator is the sample mean,
µp :=1
N
N∑n=1
k (xn, ·)
and in particular,
µp (x) = 〈µp , k (x , ·)〉H =1
N
N∑n=1
k (xn, x)N→∞−−−−→ E
p (x ′)k (x ′, x)
![Page 17: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/17.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 18: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/18.jpg)
Representing Distributions via KME
A positive definite kernel is called characteristic if,
T : M+1 (X )→ H
p 7→ µp
is injective; M+1 (X ) is the set of probability measures on X . [5, 6]
![Page 19: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/19.jpg)
Characteristic kernels
Proving a kernel is characteristic is non-trivial in general, butsufficient conditions exist. Three well known examples:
I Universality : If k is continuous, X compact, and Hk dense inC(X ) wrt L∞, then k is characteristic. [6, 8]
I Integral strict positive definiteness: A bounded measurablekernel k is called integrally strictly positive definite if∫X∫X k (x , y)µ(dx)µ(dy) > 0 for all non–zero finite signed
Borel measures µ on X . [22]
I Some stationary kernels: For X = Rd , a stationary kernel k ischaracteristic iff supp Λ (ω) = Rd , where Λ (ω) is the spectraldensity of k (cf. Bochner’s theorem). [22]
![Page 20: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/20.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 21: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/21.jpg)
t-test
Have {x1, . . . , xN}iid∼ N (µ1, σ
21), and {y1, . . . , yM}
iid∼ N (µ2, σ22),
where all parameters are unknown.
Declare H0 : µ1 = µ2;H1 : µ1 6= µ2. Then for,
µ1 :=1
N
N∑n=1
xn µ2 :=1
M
M∑m=1
ym
µi ∼ N (µi ,σ2iN ), ∀i ∈ {1, 2}, and µ1 − µ2 ∼ N (µ1 − µ2,
σ21N +
σ22
M ).
Hence tH0∼ t ν , ν :=
(s21/N+s22/M)2
(s21/N)2
N−1+
(s22/M)2
M−1
, where,
t :=(µ1 − µ2)− 0√
s21N +
s22M
, s21 :=
∑Ni=1(xi − µ1)2
N− 1, s22 :=
∑Mi=1(yi − µ2)2
M− 1.
![Page 22: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/22.jpg)
t-test
Have {x1, . . . , xN}iid∼ N (µ1, σ
21), and {y1, . . . , yM}
iid∼ N (µ2, σ22),
where all parameters are unknown.
Declare H0 : µ1 = µ2;H1 : µ1 6= µ2. Then for,
µ1 :=1
N
N∑n=1
xn µ2 :=1
M
M∑m=1
ym
µi ∼ N (µi ,σ2iN ), ∀i ∈ {1, 2}, and µ1 − µ2 ∼ N (µ1 − µ2,
σ21N +
σ22
M ).
Hence tH0∼ t ν , ν :=
(s21/N+s22/M)2
(s21/N)2
N−1+
(s22/M)2
M−1
, where,
t :=(µ1 − µ2)− 0√
s21N +
s22M
, s21 :=
∑Ni=1(xi − µ1)2
N− 1, s22 :=
∑Mi=1(yi − µ2)2
M− 1.
![Page 23: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/23.jpg)
t-test
Have {x1, . . . , xN}iid∼ N (µ1, σ
21), and {y1, . . . , yM}
iid∼ N (µ2, σ22),
where all parameters are unknown.
Declare H0 : µ1 = µ2;H1 : µ1 6= µ2. Then for,
µ1 :=1
N
N∑n=1
xn µ2 :=1
M
M∑m=1
ym
µi ∼ N (µi ,σ2iN ), ∀i ∈ {1, 2}, and µ1 − µ2 ∼ N (µ1 − µ2,
σ21N +
σ22
M ).
Hence tH0∼ t ν , ν :=
(s21/N+s22/M)2
(s21/N)2
N−1+
(s22/M)2
M−1
, where,
t :=(µ1 − µ2)− 0√
s21N +
s22M
, s21 :=
∑Ni=1(xi − µ1)2
N− 1, s22 :=
∑Mi=1(yi − µ2)2
M− 1.
![Page 24: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/24.jpg)
t-test
http://grasshopper.com/blog/the-errors-of-ab-testing-your-conclusions-can-make-things-worse/
![Page 25: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/25.jpg)
More examples
Arthur Gretton, Gatsby Neuroscience Unit
![Page 26: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/26.jpg)
More examples
Arthur Gretton, Gatsby Neuroscience Unit
![Page 27: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/27.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 28: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/28.jpg)
Recap
Kernel mean embedding,
Ep (x)
[f (x)] = 〈f , µp 〉H, ∀f ∈ H
For a characteristic kernel k and the corresponding RKHS H,
µp = µq iff p = q .
This means we might be able to distinguish distributions bycomparing the corresponding kernel mean embeddings.
![Page 29: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/29.jpg)
Recap
Kernel mean embedding,
Ep (x)
[f (x)] = 〈f , µp 〉H, ∀f ∈ H
For a characteristic kernel k and the corresponding RKHS H,
µp = µq iff p = q .
This means we might be able to distinguish distributions bycomparing the corresponding kernel mean embeddings.
![Page 30: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/30.jpg)
Recap
Kernel mean embedding,
Ep (x)
[f (x)] = 〈f , µp 〉H, ∀f ∈ H
For a characteristic kernel k and the corresponding RKHS H,
µp = µq iff p = q .
This means we might be able to distinguish distributions bycomparing the corresponding kernel mean embeddings.
![Page 31: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/31.jpg)
Maximum Mean Discrepancy (MMD)
Measure distance between mean embeddings by the worst casedifference of expected values [9],
MMD(p , q ,H) := supf ∈H,‖f ‖H≤1
(E
p (x)[f (x)]− E
q (y)[f (y)]
)= sup
f ∈H,‖f ‖H≤1
(〈µp − µq , f 〉H
)
Notice that 〈µp − µq , f 〉H ≤∥∥µp − µq ∥∥H‖f ‖H, with equality iff
f ∝ µp − µq . Hence,
MMD(p , q ,H) =∥∥µp − µq ∥∥H
![Page 32: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/32.jpg)
Maximum Mean Discrepancy (MMD)
Measure distance between mean embeddings by the worst casedifference of expected values [9],
MMD(p , q ,H) := supf ∈H,‖f ‖H≤1
(E
p (x)[f (x)]− E
q (y)[f (y)]
)= sup
f ∈H,‖f ‖H≤1
(〈µp − µq , f 〉H
)
Notice that 〈µp − µq , f 〉H ≤∥∥µp − µq ∥∥H‖f ‖H, with equality iff
f ∝ µp − µq . Hence,
MMD(p , q ,H) =∥∥µp − µq ∥∥H
![Page 33: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/33.jpg)
Witness function
Arthur Gretton, Gatsby Neuroscience Unit
![Page 34: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/34.jpg)
Witness function
Arthur Gretton, Gatsby Neuroscience Unit
![Page 35: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/35.jpg)
Estimation
Recall: empirical kernel mean embedding estimator,
µp =1
N
N∑n=1
k (xn, ·)
We can estimate the square of MMD by substituting the empirical
estimator. For {xi}Ni=1iid∼ p and {yi}Ni=1
iid∼ q ,
MMD2
=1
N(N− 1)
N∑i=1
N∑j 6=i
(k (xi , xj) + k (yi , yj)
)− 1
N2
N∑i=1
N∑j=1
(k (xi , yj) + k (xj , yi )
)
![Page 36: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/36.jpg)
Estimation
Recall: empirical kernel mean embedding estimator,
µp =1
N
N∑n=1
k (xn, ·)
We can estimate the square of MMD by substituting the empirical
estimator. For {xi}Ni=1iid∼ p and {yi}Ni=1
iid∼ q ,
MMD2
=1
N(N− 1)
N∑i=1
N∑j 6=i
(k (xi , xj) + k (yi , yj)
)− 1
N2
N∑i=1
N∑j=1
(k (xi , yj) + k (xj , yi )
)
![Page 37: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/37.jpg)
Estimation
Proof sketch:
MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =
∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H
where,
∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p
µp (x) = Ex∼p〈µp , k (x , ·)〉H
= Ex∼p
Ex∼p
k (x , x) ≈ 1
N(N− 1)
N∑i=1
N∑j 6=i
k (xi , xj)
Similarly for the other terms.
![Page 38: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/38.jpg)
Estimation
Proof sketch:
MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =
∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H
where,
∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p
µp (x) = Ex∼p〈µp , k (x , ·)〉H
= Ex∼p
Ex∼p
k (x , x) ≈ 1
N(N− 1)
N∑i=1
N∑j 6=i
k (xi , xj)
Similarly for the other terms.
![Page 39: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/39.jpg)
Estimation
Proof sketch:
MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =
∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H
where,
∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p
µp (x) = Ex∼p〈µp , k (x , ·)〉H
= Ex∼p
Ex∼p
k (x , x) ≈ 1
N(N− 1)
N∑i=1
N∑j 6=i
k (xi , xj)
Similarly for the other terms.
![Page 40: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/40.jpg)
Considerations
I The distribution of MMD2
under H0 assymptoticallyapproaches an infinite sum of shifted chi-squared randomvariables multiplied by eigenvalues of the RKHS.
I Approximations to the sampling distribution [1, 10, 11, 12].
I Calculation of the naive MMD2
estimator is O(N2).I Linear time approximations [2, 26].
I Performance is dependent on choice of the kernel. There is nouniversally best performing kernel.
I Previously heuristical, recently replaced by hyperparameteroptimisation based on test power, or Bayesian evidence [4, 24].
I Empirical kernel mean estimator might be suboptimal.I Better estimators exist [4, 16, 17].
![Page 41: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/41.jpg)
Considerations
I The distribution of MMD2
under H0 assymptoticallyapproaches an infinite sum of shifted chi-squared randomvariables multiplied by eigenvalues of the RKHS.
I Approximations to the sampling distribution [1, 10, 11, 12].
I Calculation of the naive MMD2
estimator is O(N2).I Linear time approximations [2, 26].
I Performance is dependent on choice of the kernel. There is nouniversally best performing kernel.
I Previously heuristical, recently replaced by hyperparameteroptimisation based on test power, or Bayesian evidence [4, 24].
I Empirical kernel mean estimator might be suboptimal.I Better estimators exist [4, 16, 17].
![Page 42: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/42.jpg)
Considerations
I The distribution of MMD2
under H0 assymptoticallyapproaches an infinite sum of shifted chi-squared randomvariables multiplied by eigenvalues of the RKHS.
I Approximations to the sampling distribution [1, 10, 11, 12].
I Calculation of the naive MMD2
estimator is O(N2).I Linear time approximations [2, 26].
I Performance is dependent on choice of the kernel. There is nouniversally best performing kernel.
I Previously heuristical, recently replaced by hyperparameteroptimisation based on test power, or Bayesian evidence [4, 24].
I Empirical kernel mean estimator might be suboptimal.I Better estimators exist [4, 16, 17].
![Page 43: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/43.jpg)
Considerations
I The distribution of MMD2
under H0 assymptoticallyapproaches an infinite sum of shifted chi-squared randomvariables multiplied by eigenvalues of the RKHS.
I Approximations to the sampling distribution [1, 10, 11, 12].
I Calculation of the naive MMD2
estimator is O(N2).I Linear time approximations [2, 26].
I Performance is dependent on choice of the kernel. There is nouniversally best performing kernel.
I Previously heuristical, recently replaced by hyperparameteroptimisation based on test power, or Bayesian evidence [4, 24].
I Empirical kernel mean estimator might be suboptimal.I Better estimators exist [4, 16, 17].
![Page 44: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/44.jpg)
Optimised MMD and model criticism
Arthur Gretton’s twitter.
![Page 45: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/45.jpg)
Optimised MMD and model criticism
MMD tends to lose test power with increasing dimensionality. [18]
Pick the kernel such that the test power is maximised, [24]
p H1
MMD2−MMD2
√Vm
>cα/m −MMD2
√Vm
m→∞−−−−→ 1− Φ
(cα
m√Vm− MMD2
√Vm
)
where cα is an estimator of the theoretical rejection threshold cα.
![Page 46: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/46.jpg)
Optimised MMD and model criticism
MMD tends to lose test power with increasing dimensionality. [18]
Pick the kernel such that the test power is maximised, [24]
p H1
MMD2−MMD2
√Vm
>cα/m −MMD2
√Vm
m→∞−−−−→ 1− Φ
(cα
m√Vm− MMD2
√Vm
)
where cα is an estimator of the theoretical rejection threshold cα.
![Page 47: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/47.jpg)
Optimised MMD and model criticism
Witness function evaluated on a test set, ARD weights highlighted [24].
![Page 48: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/48.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 49: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/49.jpg)
Integral Probability metrics
Have two probability measure p , and q . Integral ProbabilityMetric (IPM) defines a discrepancy measure,
dH(p , q ) := supf ∈H
∣∣∣∣∣ Ep (x)
(f (x)
)− E
q (y)
(f (y)
)∣∣∣∣∣where the space of functions H must be rich enough such thatdH(p , q ) = 0 iff p = q .
f ∗ := argmaxf ∈H∣∣Ep (f (x))− Eq (f (y))
∣∣ is the witness function.
MMD is an IPM where H is the unit ball in characteristic RKHS.
![Page 50: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/50.jpg)
Integral Probability metrics
Have two probability measure p , and q . Integral ProbabilityMetric (IPM) defines a discrepancy measure,
dH(p , q ) := supf ∈H
∣∣∣∣∣ Ep (x)
(f (x)
)− E
q (y)
(f (y)
)∣∣∣∣∣where the space of functions H must be rich enough such thatdH(p , q ) = 0 iff p = q .
f ∗ := argmaxf ∈H∣∣Ep (f (x))− Eq (f (y))
∣∣ is the witness function.
MMD is an IPM where H is the unit ball in characteristic RKHS.
![Page 51: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/51.jpg)
Integral Probability metrics
Have two probability measure p , and q . Integral ProbabilityMetric (IPM) defines a discrepancy measure,
dH(p , q ) := supf ∈H
∣∣∣∣∣ Ep (x)
(f (x)
)− E
q (y)
(f (y)
)∣∣∣∣∣where the space of functions H must be rich enough such thatdH(p , q ) = 0 iff p = q .
f ∗ := argmaxf ∈H∣∣Ep (f (x))− Eq (f (y))
∣∣ is the witness function.
MMD is an IPM where H is the unit ball in characteristic RKHS.
![Page 52: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/52.jpg)
Stein Discrepancy
Construct H such that Eq (f (y)) = 0,∀f ∈ H.
dH(p , q ) = Sq (p , T ,H) := supf ∈H
∣∣∣∣∣ Ep (x)
[(T f )(x)
]∣∣∣∣∣where T is a real–valued operator and Eq [(T f )(y)] = 0, ∀f ∈ H.
A standard choice of T when x = x ∈ R is the Stein’s operator,
(T f )(x) = Aq f (x) := sq (x)f (x) +∇x f (x)
![Page 53: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/53.jpg)
Stein Discrepancy
Construct H such that Eq (f (y)) = 0,∀f ∈ H.
dH(p , q ) = Sq (p , T ,H) := supf ∈H
∣∣∣∣∣ Ep (x)
[(T f )(x)
]∣∣∣∣∣where T is a real–valued operator and Eq [(T f )(y)] = 0, ∀f ∈ H.
A standard choice of T when x = x ∈ R is the Stein’s operator,
(T f )(x) = Aq f (x) := sq (x)f (x) +∇x f (x)
![Page 54: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/54.jpg)
Kernelised Stein Discrepancy (KSD)
Assume that p (x), and q (y) are differentiable continuous densityfunctions; q might be unnormalised [3, 15].
Use the Stein’s operator in the unit ball of the product RKHS FD,
Sq (p ,Aq ,FD) := supf ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Ep (x)
[Tr(Aq f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Ep (x)
[Tr(f (x)sq (x)T +∇2f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣∣D∑
d=1
Ep (x)
[fd(x)sq ,d(x) +
∂fd(x)
∂xd
]∣∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣〈f ,βq 〉FD
∣∣∣ =∥∥∥βq
∥∥∥FD
(s.t. βq ∈ FD)
〈f , g〉FD :=∑D
d=1〈fd , gd〉F , βq := Ep [k (x , ·)uq (x) +∇xk (x , ·)].
![Page 55: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/55.jpg)
Kernelised Stein Discrepancy (KSD)
Assume that p (x), and q (y) are differentiable continuous densityfunctions; q might be unnormalised [3, 15].
Use the Stein’s operator in the unit ball of the product RKHS FD,
Sq (p ,Aq ,FD) := supf ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Ep (x)
[Tr(Aq f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Ep (x)
[Tr(f (x)sq (x)T +∇2f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣∣D∑
d=1
Ep (x)
[fd(x)sq ,d(x) +
∂fd(x)
∂xd
]∣∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣〈f ,βq 〉FD
∣∣∣ =∥∥∥βq
∥∥∥FD
(s.t. βq ∈ FD)
〈f , g〉FD :=∑D
d=1〈fd , gd〉F , βq := Ep [k (x , ·)uq (x) +∇xk (x , ·)].
![Page 56: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/56.jpg)
KSD vs. alternative GoF tests
![Page 57: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/57.jpg)
Variational Inference using KSDNatural idea: minimise KSD between the true andan approximate posterior distribution → a particular case ofOperator Variational Inference. [19]
Quick detour: For the true posterior p (x) = p (x)/Zp andan approximation q (x) = q (x)/Zq ,
S(p , q ) = supf ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [Tr(Ap f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [Tr(Ap f (x)−Aq f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [f (x)T(sp (x)− sq (x))]∣∣∣∣∣
= Eq (x)
Eq (x)
k (x , x)(sp (x)− sq (x))T(sp (x)− sq (x))
→ q (x) =∝ q (x , ε), e.g. q (x , ε) = N (x | Tθ(ε), σ2I ) N (ε | 0, I )
![Page 58: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/58.jpg)
Variational Inference using KSDNatural idea: minimise KSD between the true andan approximate posterior distribution → a particular case ofOperator Variational Inference. [19]
Quick detour: For the true posterior p (x) = p (x)/Zp andan approximation q (x) = q (x)/Zq ,
S(p , q ) = supf ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [Tr(Ap f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [Tr(Ap f (x)−Aq f (x)
)]∣∣∣∣∣= sup
f ∈FD,‖f ‖FD≤1
∣∣∣∣∣ Eq (x) [f (x)T(sp (x)− sq (x))]∣∣∣∣∣
= Eq (x)
Eq (x)
k (x , x)(sp (x)− sq (x))T(sp (x)− sq (x))
→ q (x) =∝ q (x , ε), e.g. q (x , ε) = N (x | Tθ(ε), σ2I ) N (ε | 0, I )
![Page 59: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/59.jpg)
Learning to Sample using KSD
Two papers [15, 25] on optimising a set of particles, resp. a set ofsamples from a model (amortisation), repeatedly using KL-optimalperturbations x t = x t−1 + εt f (x).
Read Yingzhen’s blog post [13] and a recent paper [14]!
Random walk through the latent space of a GAN trained with KSD adversary. [25]
![Page 60: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/60.jpg)
Outline
RKHS
Kernel Mean Embeddings
Characteristic kernels
Two Sample Testing
MMD
Kernelised Stein Discrepancy
Kernel Bayes’ Rule
![Page 61: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/61.jpg)
Tensor product
If H and K are Hilbert spaces then so is H⊗K.
If {ai}∞i=1 spans H and {bj}∞j=1 spans K then {ai ⊗ bj}∞i ,j=0 spansH⊗K.
For any f , f ′ ∈ H and g , g ′ ∈ K we have
〈f ⊗ g , f ′ ⊗ g ′〉 = 〈f , f ′〉〈g , g ′〉
H ⊗ K is isomporphic to a space of bounded linear operatorsK → H
(f ⊗ g)g ′ = f 〈g , g ′〉
![Page 62: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/62.jpg)
Covariance operatorsCovariance operators CXY : K → H are defined as
CXY = E[kX (X , ·)⊗ kY (Y , ·)]
CXX = E[kX (X , ·)⊗ kX (X , ·)]
For any f ∈ H and g ∈ K
〈f ,CXY g〉 = 〈CYX f , g〉 = E[f (X )g(Y )]
Empirical estimators
(Xi ,Yi ) ∼i .i .d . P(X ,Y )
CXY =1
N
N∑i=1
kX (Xi , ·)⊗ kY (Yi , ·)
CXX =1
N
N∑i=1
kX (Xi , ·)⊗ kX (Xi , ·)
![Page 63: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/63.jpg)
Conditional embeddings
LetµX |Y=y = E[kX (X , ·)|Y = y ]
We seek an operator UX |Y such that for all y
µX |Y=y = UX |Y kY (y , ·)
which we call the conditional mean embedding.
![Page 64: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/64.jpg)
Conditional embeddings
Lemma ([21])
For any f ∈ H let h(y) = E[f (X )|Y = y ] and assume that h ∈ K.Then
CYY h = CYX f
Proof.
Take any g ∈ K.
〈g ,CYY h〉 = EY∼P(Y )
[g(Y )h(Y )] =
EY∼P(Y )
[g(Y ) EX∼P(X |Y )
[f (X )|Y ]] = EY∼P(Y )
[ EX∼P(X |Y )
[g(Y )f (X )|Y ]]
EX ,Y∼P(X ,Y )
[g(Y )f (X )] = 〈g ,CYX f 〉
![Page 65: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/65.jpg)
Conditional embeddings
Whenever CYY is invertible we have
UX |Y = CXYC−1YY
in practice we use a regularized version
U εX |Y = CXY (CYY + εI )−1
which can be estimated by
U εX |Y = CXY (CYY + εI )−1
![Page 66: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/66.jpg)
Kernel Bayes’ rule
We could compute posterior embedding by
µX |Y=y = UX |Y kY (y , ·)
but we may want a different prior.
Setting:
I let P(X ,Y ) be the joint distribution of the model
I let Q(X ,Y ) be a different distribution such thatP(Y |X = x) = Q(Y |X = x) for all x
I we have a sample (Xi ,Yi ) ∼ Q and Xj ∼ P(X )
I we want to estimate the posterior embeddingEX∼P(X |Y=y)[kX (X , ·)] for some particular y
![Page 67: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/67.jpg)
Kernel sum rule
Sum ruleP(X ) =
∑Y
P(X ,Y )
Kernel sum ruleµX = UX |YµY
CorollaryCXX = UXX |YµY
![Page 68: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/68.jpg)
Kernel product rule
Product ruleP(X ,Y ) = P(X |Y )P(Y )
Kernel product ruleCXY = UXYCYY
![Page 69: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/69.jpg)
Kernel Bayes’ rule
Observe that
CXY = UY |XCXX
CYY = UYY |XµX
and so
UX |Y = CXYC−1YY = (UY |XCXX )(UYY |XµX )−1
lets us express UX |Y in terms of UY |X .
![Page 70: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/70.jpg)
Reminder
Setting:
I let P(X ,Y ) be the joint distribution of the model
I let Q(X ,Y ) be a different distribution such thatP(Y |X = x) = Q(Y |X = x) for all x
I we have a sample (Xi ,Yi ) ∼ Q and Xj ∼ P(X )
I we want to estimate the posterior embeddingEX∼P(X |Y=y)[kX (X , ·)] for some particular y
![Page 71: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/71.jpg)
Kernel Bayes’ rule
We have
UPY |X = UQ
Y |X
UPX |Y 6= U
QX |Y
but
UPX |Y = (UP
Y |XCPXX )(UP
YY |XµPX )−1
= (UQY |XC
PXX )(UQ
YY |XµPX )−1
so
µPX |Y=y = (UQY |XC
PXX )(UQ
YY |XµPX )−1kY (y , ·)
![Page 72: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/72.jpg)
Kernel Bayes’ rule
In practice we use the following estimator
µX |Y=y = (UQY |X C
PXX )((UQ
YY |X µPX )2 + εI )−1(UQ
YY |X µPX )kY (y , ·)
which can be written as
µX |Y=y = A(B2 + δI )−1BkY (y , ·)A = UQ
Y |X CPXX = CQ
YX (CQXX + εI )−1CP
XX
B = UQYY |X µ
PX = CQ
YYX (CQXX + εI )−1µPX
For ε = N−13 and δ = N−
427 it can be shown [7] that∥∥∥µX |Y=y − µX |Y=y
∥∥∥ = Op(N−427 )
![Page 73: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/73.jpg)
Kernel Bayes’ rule
When is Kernel Bayes’ Rule useful?
I when densities aren’t tractable (ABC)
I when you don’t know how to write a model but you know howto pick a kernel
I perhaps it can perform better than alternatives even if theabove aren’t satisfied
![Page 74: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/74.jpg)
Other topics
Other topics combining KMEs with Bayesian inference
I adaptive Metropolis-Hastings using KME [20]
I Hamiltonian Monte Carlo without gradients [23]
I Bayesian estimation of KMEs [4]
![Page 75: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/75.jpg)
References I
M. A. Arcones and E. Gine.
On the Bootstrap of U and V Statistics.
Ann. Statist., 20(2):655–674, 06 1992.
K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton.
Fast Two-Sample Testing with Analytic Representations ofProbability Measures.
ArXiv e-prints, 2015.
K. Chwialkowski, H. Strathmann, and A. Gretton.
A Kernel Test of Goodness of Fit.
In Proceedings of the International Conference on Machine Learning(ICML), 2016.
S. Flaxman, D. Sejdinovic, J. P. Cunningham, and S. Filippi.
Bayesian learning of kernel embeddings.
In UAI, 2016.
![Page 76: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/76.jpg)
References IIK. Fukumizu, F. R. Bach, and M. I. Jordan.
Dimensionality Reduction for Supervised Learning with ReproducingKernel Hilbert Spaces.
JMLR, 5:73–99, 2004.
K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf.
Kernel Measures of Conditional Dependence.
In Proceedings of the 20th International Conference on NeuralInformation Processing Systems, NIPS’07, pages 489–496, 2007.
K. Fukumizu, L. Song, and A. Gretton.
Kernel Bayes’ rule: Bayesian inference with positive definite kernels.
Journal of Machine Learning Research, 14:3753–3783, 2013.
![Page 77: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/77.jpg)
References IIIA. Gretton, K. M. Borgwardt, M. Rasch, B. Scholkopf, and A. J.Smola.
A Kernel Method for the Two-sample-problem.
In Proceedings of the 19th International Conference on NeuralInformation Processing Systems, NIPS’06, pages 513–520, 2006.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, andA. Smola.
A Kernel Two-sample Test.
JMLR, 13:723–773, Mar. 2012.
A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur.
A Fast, Consistent Kernel Two-sample Test.
In Proceedings of the 22Nd International Conference on NeuralInformation Processing Systems, NIPS’09, pages 673–681, 2009.
![Page 78: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/78.jpg)
References IVW. Hoeffding.
Probability Inequalities for Sums of Bounded Random Variables.
Journal of the American Statistical Association, 58(301):13–30,1963.
N. L. Johnson, S. Kotz, and N. Balakrishnan.
Continuous Univariate Distributions.
Wiley, 1994.
Y. Li.
Stein Variational Gradient Descent: A General Purpose BayesianInference Algorithm, August 2016.
Y. Li, R. E. Turner, and Q. Liu.
Approximate Inference with Amortised MCMC.
arXiv preprint arXiv:1702.08343, 2017.
![Page 79: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/79.jpg)
References VQ. Liu, J. D. Lee, and M. I. Jordan.
A Kernelized Stein Discrepancy for Goodness-of-fit Tests and ModelEvaluation.
arXiv preprint arXiv:1602.03253, 2016.
K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, andB. Scholkopf.
Kernel mean shrinkage estimators.
Journal of Machine Learning Research, 17(48):1–41, 2016.
K. Muandet, B. Sriperumbudur, and B. Scholkopf.
Kernel mean estimation via spectral filtering.
In Advances in Neural Information Processing Systems, pages 1–9,2014.
![Page 80: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/80.jpg)
References VIA. Ramdas, S. J. Reddi, B. Poczos, A. Singh, and L. Wasserman.
On the high-dimensional power of linear-time kernel two-sampletesting under mean-difference alternatives.
arXiv preprint arXiv:1411.6314, 2014.
R. Ranganath, D. Tran, J. Altosaar, and D. Blei.
Operator Variational Inference.
In Advances in Neural Information Processing Systems, pages496–504, 2016.
D. Sejdinovic, H. Strathmann, M. L. Garcia, C. Andrieu, andA. Gretton.
Kernel adaptive Metropolis-Hastings.
In ICML, 2014.
![Page 81: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/81.jpg)
References VIIL. Song, J. Huang, A. Smola, and K. Fukumizu.
Hilbert space embeddings of conditional distributions withapplications to dynamical systems.
In ICML, 2009.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, andG. R. Lanckriet.
Hilbert Space Embeddings and Metrics on Probability Measures.
JMLR, 11:1517–1561, 2010.
H. Strathmann, D. Sejdinovic, S. Livingstone, Z. Szabo, andA. Gretton.
Gradient-free Hamiltonian Monte Carlo with efficient kernelexponential families.
In NIPS, 2015.
![Page 82: Kernel Mean Embeddings - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · MMD(p;q;H) = p q H. Witness function Arthur Gretton, Gatsby Neuroscience Unit.](https://reader033.fdocuments.net/reader033/viewer/2022060711/60770e494e6c5b1f033a5439/html5/thumbnails/82.jpg)
References VIIID. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas,A. Smola, and A. Gretton.
Generative models and model criticism via optimized maximummean discrepancy.
arXiv preprint arXiv:1611.04488, 2016.
D. Wang and Q. Liu.
Learning to Draw Samples: With Application to Amortized MLE forGenerative Adversarial Learning.
arXiv preprint arXiv:1611.01722, 2016.
Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic.
Large-Scale Kernel Methods for Independence Testing.
ArXiv e-prints, 2016.