On the Relation Between Universality, Characteristic ...cc-universality [Micchelli et al., 2006] X:...

On the Relation Between Universality,Characteristic Kernels and RKHS Embedding of

Measures

Bharath K. Sriperumbudur⋆, Kenji Fukumizu† andGert R. G. Lanckriet⋆

⋆UC San Diego †The Institute of Statistical Mathematics

AISTATS 2010

Outline

RKHS embedding of probability measures

Characteristic kernels

Universal kernels

Various notions of universality

Novel characterization of universality

Relation to RKHS embedding of signed measures

RKHS Embedding of Probability Measures

Input space : X

Feature space : H (with reproducing kernel, k)

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

RKHS Embedding of Probability Measures

Input space : X

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

Extension to probability measures:

P 7→ Φ(P) :=

k(·, x) dP(x)

RKHS Embeddings of Probability Measures

Input space : X

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

P 7→ Φ(P) :=

k(·, x) dP(x)

︸︷︷︸EY∼P[Φ(Y )]=EY∼P[k(·,Y )]

RKHS Embeddings of Probability Measures

Input space : X

Feature map : Φ

Φ : X → H x 7→ Φ(x) := k(·, x)

P 7→ Φ(P) :=

k(·, x) dP(x)

Advantage: Φ(P) can distinguish P by high-order moments.

k(y , x) = c0 + c1(xy) + c2(xy)2 + · · · (ci 6= 0) e.g. k(y , x) = exy

Φ(P)(y) = c0 + c1

x dP(x)

)y + c2

x2 dP(x)

)y2 + · · ·

Applications

Two-sample problem:

Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.

Determine: are P and Q different?

Applications

Two-sample problem:

γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.

H0 : P = Q H0 : γ(P,Q) = 0≡

H1 : P 6= Q H1 : γ(P,Q) > 0

Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.

Applications

Two-sample problem:

γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.

H0 : P = Q H0 : γ(P,Q) = 0≡

H1 : P 6= Q H1 : γ(P,Q) > 0

Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.

Other applications:

Hypothesis testing : Independence test, Goodness of fit test, etc.

Feature selection, message passing, density estimation, etc.

Characteristic Kernels

Define: k is characteristic if

P 7→

k(·, x) dP(x) is injective.

In other words,∫

k(·, x) dP(x) =

k(·, x) dQ(x) ⇔ P = Q.

When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.

Characteristic Kernels

Define: k is characteristic if

P 7→

k(·, x) dP(x) is injective.

In other words,∫

k(·, x) dP(x) =

k(·, x) dQ(x) ⇔ P = Q.

When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.

Not all kernels are characteristic, e.g., k(x , y) = xT y .

µP = µQ ; P = Q

When is k characteristic? [Gretton et al., 2007,Sriperumbudur et al., 2008, Fukumizu et al., 2008,Fukumizu et al., 2009, Sriperumbudur et al., 2009].

Universal Kernels Regularization approach to supervised learning

minf∈H

ℓ(f (xi ), yi ) + λΩ[f ], (1)

where λ > 0 and (xi , yi )ni=1 is the training data.

minf∈H

ℓ(f (xi ), yi ) + λΩ[f ], (1)

Representer theorem : The solution to (1) is of the form

cik(·, xi ),

where cini=1 ⊂ R are the parameters typically obtained from the

training data.

minf∈H

ℓ(f (xi ), yi ) + λΩ[f ], (1)

cik(·, xi ),

training data.

Question: Can f approximate any target function arbitrarily “well”as n → ∞?

minf∈H

ℓ(f (xi ), yi ) + λΩ[f ], (1)

cik(·, xi ),

training data.

Question: Can f approximate any target function arbitrarily “well”as n → ∞?

We need H to be “dense” in the space of target functions — k isuniversal.

Various Notions of Universality

Prior work

c-universality [Steinwart, 2001]

cc-universality [Micchelli et al., 2006]

Proposed notion: c0-universality

Characterization of c-, cc- and c0-universality : Relation to RKHSembedding of measures

Translation invariant kernels on Rd

Radial kernels on Rd

X : compact metric space

k : continuous on X × X

Target function space : C (X ), continuous functions on X

Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).

Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!

Examples: Gaussian and Laplacian kernels on any compact subset ofRd .

Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!

Examples: Gaussian and Laplacian kernels on any compact subset ofRd .

Issue: X is compact which excludes many interesting spaces, such as Rd .

X : Hausdorff space

Target function space : C (X )

Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.

X : Hausdorff space

In other words, for any compact set Z ⊂ X , H|Z := f|Z : f ∈ H isdense in C (Z ) w.r.t. ‖ · ‖u.

X : Hausdorff space

Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.

Examples: Gaussian, Laplacian and Sinc kernels on Rd .

X : Hausdorff space

Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.

Examples: Gaussian, Laplacian and Sinc kernels on Rd .

Issue: Topology of compact convergence is weaker than the topology ofuniform convergence.

Proposed Notion: c0-universality

X : locally compact Hausdorff (LCH) space

Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).

k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .

Proposed Notion: c0-universality

X : locally compact Hausdorff (LCH) space

Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).

k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .

Define k to be c0-universal if H is dense in C0(X ) w.r.t. ‖ · ‖u.

Handles non-compact X and ensures uniform convergence overentire X .

Embedding Characterization of UniversalityTheorem

k is c0-universal if and only if

µ 7→

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective. Mb(X ) is the space of finite signed Radon measures onX .

µ 7→

k(·, x) dµ(x), µ ∈ Mb(X ),

k is cc-universal if and only if

µ 7→

k(·, x) dµ(x), µ ∈ Mbc(X ),

is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.

µ 7→

k(·, x) dµ(x), µ ∈ Mb(X ),

µ 7→

k(·, x) dµ(x), µ ∈ Mbc(X ),

is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.

k is c-universal if and only if

µ 7→

k(·, x) dµ(x), µ ∈ Mb(X ),

is injective.

Postive Definite Characterization of Universality

Theorem

k is c0-universal (resp. c-universal) if and only if

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

Theorem

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.

Theorem

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.

k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.

If k is c-, cc- or c0-universal, then it is strictly positive definite.

X is an LCH space: Summary

Translation Invariant Kernels on Rd

X = Rd and k(x , y) = ψ(x − y), where

ψ(x) =

e√−1xTω dΛ(ω), x ∈ Rd ,

and Λ is a non-negative finite Borel measure.

Translation Invariant Kernels on Rd

X = Rd and k(x , y) = ψ(x − y), where

ψ(x) =

e√−1xTω dΛ(ω), x ∈ Rd ,

and Λ is a non-negative finite Borel measure.

Theorem

k is c0-universal if and only if supp(Λ) = Rd .

k is c0-universal if and only if it is characteristic.

If supp(Λ) has a non-empty interior, then k is cc-universal.[Micchelli et al., 2006]

Examples

Gaussian kernel: ψ(x) = e−x2/2σ2

; Ψ(ω) = σe−σ2ω2/2; dΛ(ω) = Ψ(ω) dω.

−4 −3 −2 −1 0 1 2 3 40

Laplacian kernel: ψ(x) = e−σ|x|; Ψ(ω) =√

σσ2+ω2 .

−4 −3 −2 −1 0 1 2 3 40

(2/πσ2)1/2

Examples

B1-spline kernel: ψ(x) = (1− |x |)1[−1,1](x); Ψ(ω) = 2√

2√π

sin2(ω2)

−3 −2 −1 0 1 2 30

−8π −6π −4π −2π 0 2π 4π 6π 8π

1/2 Ψ

Sinc kernel: ψ(x) = sin(σx)x

; Ψ(ω) =√

π21[−σ,σ](ω).

−0.2

−6π −5π −4π −3π −2π −π 0 π 2π 3π 4π 5π 6π

−3 −2 −1 0 1 2 30

(π/2)1/2

Translation Invariant Kernels on Rd : Summary

Radial Kernels on Rd

k(x , y) =

[0,∞)

e−t‖x−y‖22 dν(t),

where ν is a finite non-negative Borel measure on [0,∞).

Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)

−β , β > d2 , c > 0, etc.

Radial Kernels on Rd

k(x , y) =

[0,∞)

e−t‖x−y‖22 dν(t),

where ν is a finite non-negative Borel measure on [0,∞).

Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)

−β , β > d2 , c > 0, etc.

TheoremThe following conditions are equivalent.

supp(ν) 6= 0.

k is c0-universal.

k is cc-universal.

k is characteristic.

k is strictly pd.

Radial Kernels on Rd : Summary

Summary

Characteristic kernel

Injective RKHS embedding of probability measures.

Applications: Hypothesis testing, feature selection, etc.

Universal kernel

Consistency of learning algorithms.

Injective RKHS embedding of finite signed Radon measures.

Clarified the relation between various notions of universality andcharacteristic kernels.

References

Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.

Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.

Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.

Micchelli, C. A., Xu, Y., and Zhang, H. (2006).Universal kernels.Journal of Machine Learning Research, 7:2651–2667.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., and Scholkopf, B. (2009).Kernel choice and classifiability for RKHS embeddings of probability distributions.In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A., editors, Advances in Neural Information Processing

Systems 22, pages 1750–1758. MIT Press.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.

Steinwart, I. (2001).On the influence of the kernel on the consistency of support vector machines.Journal of Machine Learning Research, 2:67–93.

On the Relation Between Universality, Characteristic ...cc-universality [Micchelli et al., 2006] X:...

Documents

Transcript of On the Relation Between Universality, Characteristic ...cc-universality [Micchelli et al., 2006] X:...

Pope Universality Papacy W Photos

Sdg universality report_-_may_2015

Leininger2 - Culture Care Universality & Diversity

The Relative Universality of Human Rights Relative Universality of Human Rights Donnelly, Jack. Human Rights Quarterly, ... universality” of internationally recognized human rights.

On the Universality of Hawking Radiationphilsci-archive.pitt.edu/16091/1/On the Universality of...1 Introduction Universality arguments have been developed in the context of black

Universality and Individuality in a Neural Codepapers.nips.cc/paper/1894-universality-and-individuality...Universality and individuality in a neural code Elad Schneidman,1,2 Naama

Universality...Xbox Automata, languages, computability, universality, complexity, logic. 3 Universality Q. Which one of the following does not belong? Cray Dell PC iMac Espresso maker

Universality, Ethics and International Relations

Universality and RG Explanations - PhilSci-Archivephilsci-archive.pitt.edu/13460/1/Universality-rg.pdf · 2017-09-20 · Universality and RG Explanations Robert W. Batterman Department

Universality in Focus

2000 - Contingency, Hegemony, Universality

Universality of Vedanta

Disproportionality & Preschool Universality & Fairness

Universality and RG Explanations - University of Pittsburghrbatterm/Universality-rg-preprint.pdf · Universality and RG Explanations Robert W. Batterman Department of Philosophy University

UNIvERSAlITy AND PARTICUlARITy OF HUMAN RIGHTS: A ...shapesea.com/.../03/01-Universality-and-Particularity-of-human-rights... · 01.03.2018 · Universality vs. Particularity of

Universality of TMD distribution functions

Class 39: Universality

Universality of Religion - Hopkins

100129 Donnelly Universality

Claes Ryn - Universality and History