On the Relation Between Universality, Characteristic ...cc-universality [Micchelli et al., 2006] X:...

Post on 09-Jun-2020

10 views 0 download

Transcript of On the Relation Between Universality, Characteristic ...cc-universality [Micchelli et al., 2006] X:...

On the Relation Between Universality,Characteristic Kernels and RKHS Embedding of

Measures

Bharath K. Sriperumbudur⋆, Kenji Fukumizu† andGert R. G. Lanckriet⋆

⋆UC San Diego †The Institute of Statistical Mathematics

AISTATS 2010

Outline

RKHS embedding of probability measures

Characteristic kernels

Universal kernels

Various notions of universality

Novel characterization of universality

Relation to RKHS embedding of signed measures

RKHS Embedding of Probability Measures

Input space : X

Feature space : H (with reproducing kernel, k)

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

RKHS Embedding of Probability Measures

Input space : X

Feature space : H (with reproducing kernel, k)

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

Extension to probability measures:

P 7→ Φ(P) :=

X

k(·, x) dP(x)

RKHS Embeddings of Probability Measures

Input space : X

Feature space : H (with reproducing kernel, k)

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

Extension to probability measures:

P 7→ Φ(P) :=

X

k(·, x) dP(x)

︸ ︷︷ ︸EY∼P[Φ(Y )]=EY∼P[k(·,Y )]

RKHS Embeddings of Probability Measures

Input space : X

Feature space : H (with reproducing kernel, k)

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

Extension to probability measures:

P 7→ Φ(P) :=

X

k(·, x) dP(x)

Advantage: Φ(P) can distinguish P by high-order moments.

k(y , x) = c0 + c1(xy) + c2(xy)2 + · · · (ci 6= 0) e.g. k(y , x) = exy

Φ(P)(y) = c0 + c1

(∫

X

x dP(x)

)y + c2

(∫

X

x2 dP(x)

)y2 + · · ·

Applications

Two-sample problem:

Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.

Determine: are P and Q different?

Applications

Two-sample problem:

Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.

Determine: are P and Q different?

γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.

H0 : P = Q H0 : γ(P,Q) = 0≡

H1 : P 6= Q H1 : γ(P,Q) > 0

Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.

Applications

Two-sample problem:

Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.

Determine: are P and Q different?

γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.

H0 : P = Q H0 : γ(P,Q) = 0≡

H1 : P 6= Q H1 : γ(P,Q) > 0

Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.

Other applications:

Hypothesis testing : Independence test, Goodness of fit test, etc.

Feature selection, message passing, density estimation, etc.

Characteristic Kernels

Define: k is characteristic if

P 7→

X

k(·, x) dP(x) is injective.

In other words,∫

X

k(·, x) dP(x) =

X

k(·, x) dQ(x) ⇔ P = Q.

When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.

Characteristic Kernels

Define: k is characteristic if

P 7→

X

k(·, x) dP(x) is injective.

In other words,∫

X

k(·, x) dP(x) =

X

k(·, x) dQ(x) ⇔ P = Q.

When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.

Not all kernels are characteristic, e.g., k(x , y) = xT y .

µP = µQ ; P = Q

When is k characteristic? [Gretton et al., 2007,Sriperumbudur et al., 2008, Fukumizu et al., 2008,Fukumizu et al., 2009, Sriperumbudur et al., 2009].

Universal Kernels Regularization approach to supervised learning

minf∈H

1

n

n∑

i=1

ℓ(f (xi ), yi ) + λΩ[f ], (1)

where λ > 0 and (xi , yi )ni=1 is the training data.

Universal Kernels Regularization approach to supervised learning

minf∈H

1

n

n∑

i=1

ℓ(f (xi ), yi ) + λΩ[f ], (1)

where λ > 0 and (xi , yi )ni=1 is the training data.

Representer theorem : The solution to (1) is of the form

f =

n∑

i=1

cik(·, xi ),

where cini=1 ⊂ R are the parameters typically obtained from the

training data.

Universal Kernels Regularization approach to supervised learning

minf∈H

1

n

n∑

i=1

ℓ(f (xi ), yi ) + λΩ[f ], (1)

where λ > 0 and (xi , yi )ni=1 is the training data.

Representer theorem : The solution to (1) is of the form

f =

n∑

i=1

cik(·, xi ),

where cini=1 ⊂ R are the parameters typically obtained from the

training data.

Question: Can f approximate any target function arbitrarily “well”as n → ∞?

Universal Kernels Regularization approach to supervised learning

minf∈H

1

n

n∑

i=1

ℓ(f (xi ), yi ) + λΩ[f ], (1)

where λ > 0 and (xi , yi )ni=1 is the training data.

Representer theorem : The solution to (1) is of the form

f =

n∑

i=1

cik(·, xi ),

where cini=1 ⊂ R are the parameters typically obtained from the

training data.

Question: Can f approximate any target function arbitrarily “well”as n → ∞?

We need H to be “dense” in the space of target functions — k isuniversal.

Various Notions of Universality

Prior work

c-universality [Steinwart, 2001]

cc-universality [Micchelli et al., 2006]

Proposed notion: c0-universality

Characterization of c-, cc- and c0-universality : Relation to RKHSembedding of measures

Translation invariant kernels on Rd

Radial kernels on Rd

c-universality [Steinwart, 2001]

X : compact metric space

k : continuous on X × X

Target function space : C (X ), continuous functions on X

Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).

c-universality [Steinwart, 2001]

X : compact metric space

k : continuous on X × X

Target function space : C (X ), continuous functions on X

Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).

Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!

Examples: Gaussian and Laplacian kernels on any compact subset ofRd .

c-universality [Steinwart, 2001]

X : compact metric space

k : continuous on X × X

Target function space : C (X ), continuous functions on X

Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).

Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!

Examples: Gaussian and Laplacian kernels on any compact subset ofRd .

Issue: X is compact which excludes many interesting spaces, such as Rd .

cc-universality [Micchelli et al., 2006]

X : Hausdorff space

k : continuous on X × X

Target function space : C (X )

Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.

cc-universality [Micchelli et al., 2006]

X : Hausdorff space

k : continuous on X × X

Target function space : C (X )

Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.

In other words, for any compact set Z ⊂ X , H|Z := f|Z : f ∈ H isdense in C (Z ) w.r.t. ‖ · ‖u.

cc-universality [Micchelli et al., 2006]

X : Hausdorff space

k : continuous on X × X

Target function space : C (X )

Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.

Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.

Examples: Gaussian, Laplacian and Sinc kernels on Rd .

cc-universality [Micchelli et al., 2006]

X : Hausdorff space

k : continuous on X × X

Target function space : C (X )

Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.

Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.

Examples: Gaussian, Laplacian and Sinc kernels on Rd .

Issue: Topology of compact convergence is weaker than the topology ofuniform convergence.

Proposed Notion: c0-universality

X : locally compact Hausdorff (LCH) space

Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).

k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .

Proposed Notion: c0-universality

X : locally compact Hausdorff (LCH) space

Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).

k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .

Define k to be c0-universal if H is dense in C0(X ) w.r.t. ‖ · ‖u.

Handles non-compact X and ensures uniform convergence overentire X .

Embedding Characterization of UniversalityTheorem

k is c0-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective. Mb(X ) is the space of finite signed Radon measures onX .

Embedding Characterization of UniversalityTheorem

k is c0-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective. Mb(X ) is the space of finite signed Radon measures onX .

k is cc-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mbc(X ),

is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.

Embedding Characterization of UniversalityTheorem

k is c0-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective. Mb(X ) is the space of finite signed Radon measures onX .

k is cc-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mbc(X ),

is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.

k is c-universal if and only if

µ 7→

X

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective.

Postive Definite Characterization of Universality

Theorem

k is c0-universal (resp. c-universal) if and only if

X

X

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

Postive Definite Characterization of Universality

Theorem

k is c0-universal (resp. c-universal) if and only if

X

X

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

k is cc-universal if and only if

X

X

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.

Postive Definite Characterization of Universality

Theorem

k is c0-universal (resp. c-universal) if and only if

X

X

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

k is cc-universal if and only if

X

X

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.

If k is c-, cc- or c0-universal, then it is strictly positive definite.

X is an LCH space: Summary

Translation Invariant Kernels on Rd

X = Rd and k(x , y) = ψ(x − y), where

ψ(x) =

Rd

e√−1xTω dΛ(ω), x ∈ Rd ,

and Λ is a non-negative finite Borel measure.

Translation Invariant Kernels on Rd

X = Rd and k(x , y) = ψ(x − y), where

ψ(x) =

Rd

e√−1xTω dΛ(ω), x ∈ Rd ,

and Λ is a non-negative finite Borel measure.

Theorem

k is c0-universal if and only if supp(Λ) = Rd .

k is c0-universal if and only if it is characteristic.

If supp(Λ) has a non-empty interior, then k is cc-universal.[Micchelli et al., 2006]

Examples

Gaussian kernel: ψ(x) = e−x2/2σ2

; Ψ(ω) = σe−σ2ω2/2; dΛ(ω) = Ψ(ω) dω.

−4 −3 −2 −1 0 1 2 3 40

1

x

ψ(x

)

−4 −3 −2 −1 0 1 2 3 40

σ

Ψ(ω

)

Laplacian kernel: ψ(x) = e−σ|x|; Ψ(ω) =√

σσ2+ω2 .

−4 −3 −2 −1 0 1 2 3 40

1

x

ψ(x

)

−4 −3 −2 −1 0 1 2 3 40

(2/πσ2)1/2

Ψ(ω

)

Examples

B1-spline kernel: ψ(x) = (1− |x |)1[−1,1](x); Ψ(ω) = 2√

2√π

sin2(ω2)

ω2 .

−3 −2 −1 0 1 2 30

1

x

ψ(x

)

0

0.2

0.4

0.6

0.8

1

−8π −6π −4π −2π 0 2π 4π 6π 8π

(2π)

1/2 Ψ

(ω)

Sinc kernel: ψ(x) = sin(σx)x

; Ψ(ω) =√

π21[−σ,σ](ω).

−0.2

0

0.2

0.4

0.6

0.8

1

−6π −5π −4π −3π −2π −π 0 π 2π 3π 4π 5π 6π

ψ(x

)

Ψ(ω

)

−3 −2 −1 0 1 2 30

(π/2)1/2

Translation Invariant Kernels on Rd : Summary

Radial Kernels on Rd

Let

k(x , y) =

[0,∞)

e−t‖x−y‖22 dν(t),

where ν is a finite non-negative Borel measure on [0,∞).

Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)

−β , β > d2 , c > 0, etc.

Radial Kernels on Rd

Let

k(x , y) =

[0,∞)

e−t‖x−y‖22 dν(t),

where ν is a finite non-negative Borel measure on [0,∞).

Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)

−β , β > d2 , c > 0, etc.

TheoremThe following conditions are equivalent.

supp(ν) 6= 0.

k is c0-universal.

k is cc-universal.

k is characteristic.

k is strictly pd.

Radial Kernels on Rd : Summary

Summary

Characteristic kernel

Injective RKHS embedding of probability measures.

Applications: Hypothesis testing, feature selection, etc.

Universal kernel

Consistency of learning algorithms.

Injective RKHS embedding of finite signed Radon measures.

Clarified the relation between various notions of universality andcharacteristic kernels.

References

Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.

Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.

Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.

Micchelli, C. A., Xu, Y., and Zhang, H. (2006).Universal kernels.Journal of Machine Learning Research, 7:2651–2667.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., and Scholkopf, B. (2009).Kernel choice and classifiability for RKHS embeddings of probability distributions.In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A., editors, Advances in Neural Information Processing

Systems 22, pages 1750–1758. MIT Press.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.

Steinwart, I. (2001).On the influence of the kernel on the consistency of support vector machines.Journal of Machine Learning Research, 2:67–93.