Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf ·...

48
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School ubingen, 2017

Transcript of Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf ·...

Page 1: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Lecture 3

Approximate Kernel Methods

Bharath K. Sriperumbudur

Department of Statistics, Pennsylvania State University

Machine Learning Summer School

Tubingen, 2017

Page 2: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Outline

I Motivating example

I Ridge regression

I Approximation methods

I Kernel PCA

I Computational vs. statistical trade off (Joint work with Nicholas Sterge,

Pennsylvania State University)

Page 3: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Motivating Example: Ridge regressionI Given: {(xi , yi )}n

i=1 where xi ∈ Rd , yi ∈ RI Task: Find a linear regressor f = 〈w , ·〉2 s.t. f (xi ) ≈ yi ,

minw∈Rd

1

n

n∑i=1

(〈w , xi 〉2 − yi )2 + λ‖w‖2

2 (λ > 0)

I Solution: For X := (x1, . . . , xn) ∈ Rd×n andy := (y1, . . . , yn)> ∈ Rn,

w =1

n

(1

nXX> + λId

)−1

Xy︸ ︷︷ ︸primal

I Easy: (1

nXX> + λId

)X = X

(1

nX>X + λIn

)

w =1

nX

(1

nX>X + λIn

)−1

y︸ ︷︷ ︸dual

Page 4: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Motivating Example: Ridge regressionI Given: {(xi , yi )}n

i=1 where xi ∈ Rd , yi ∈ RI Task: Find a linear regressor f = 〈w , ·〉2 s.t. f (xi ) ≈ yi ,

minw∈Rd

1

n

n∑i=1

(〈w , xi 〉2 − yi )2 + λ‖w‖2

2 (λ > 0)

I Solution: For X := (x1, . . . , xn) ∈ Rd×n andy := (y1, . . . , yn)> ∈ Rn,

w =1

n

(1

nXX> + λId

)−1

Xy︸ ︷︷ ︸primal

I Easy: (1

nXX> + λId

)X = X

(1

nX>X + λIn

)

w =1

nX

(1

nX>X + λIn

)−1

y︸ ︷︷ ︸dual

Page 5: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Motivating Example: Ridge regressionI Given: {(xi , yi )}n

i=1 where xi ∈ Rd , yi ∈ RI Task: Find a linear regressor f = 〈w , ·〉2 s.t. f (xi ) ≈ yi ,

minw∈Rd

1

n

n∑i=1

(〈w , xi 〉2 − yi )2 + λ‖w‖2

2 (λ > 0)

I Solution: For X := (x1, . . . , xn) ∈ Rd×n andy := (y1, . . . , yn)> ∈ Rn,

w =1

n

(1

nXX> + λId

)−1

Xy︸ ︷︷ ︸primal

I Easy: (1

nXX> + λId

)X = X

(1

nX>X + λIn

)

w =1

nX

(1

nX>X + λIn

)−1

y︸ ︷︷ ︸dual

Page 6: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Motivating Example: Ridge regressionI Given: {(xi , yi )}n

i=1 where xi ∈ Rd , yi ∈ RI Task: Find a linear regressor f = 〈w , ·〉2 s.t. f (xi ) ≈ yi ,

minw∈Rd

1

n

n∑i=1

(〈w , xi 〉2 − yi )2 + λ‖w‖2

2 (λ > 0)

I Solution: For X := (x1, . . . , xn) ∈ Rd×n andy := (y1, . . . , yn)> ∈ Rn,

w =1

n

(1

nXX> + λId

)−1

Xy︸ ︷︷ ︸primal

I Easy: (1

nXX> + λId

)X = X

(1

nX>X + λIn

)

w =1

nX

(1

nX>X + λIn

)−1

y︸ ︷︷ ︸dual

Page 7: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Ridge regression

I Prediction: Given t ∈ Rd

f (t) = 〈w , t〉2 = y>X>(XX> + nλId

)−1t

= y>(X>X + nλIn

)−1X>t

I How does X>X look like?

X>X =

〈x1, x1〉2 〈x1, x2〉2 · · · 〈x1, xn〉2〈x2, x1〉1 〈x2, x2〉2 · · · 〈x2, xn〉2

... 〈xi , xj〉2. . .

...〈xn, x1〉1 〈xn, x2〉2 · · · 〈xn, xn〉2

︸ ︷︷ ︸

Matrix of inner products: Gram Matrix

Page 8: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Ridge regression

I Prediction: Given t ∈ Rd

f (t) = 〈w , t〉2 = y>X>(XX> + nλId

)−1t

= y>(X>X + nλIn

)−1X>t

I How does X>X look like?

X>X =

〈x1, x1〉2 〈x1, x2〉2 · · · 〈x1, xn〉2〈x2, x1〉1 〈x2, x2〉2 · · · 〈x2, xn〉2

... 〈xi , xj〉2. . .

...〈xn, x1〉1 〈xn, x2〉2 · · · 〈xn, xn〉2

︸ ︷︷ ︸

Matrix of inner products: Gram Matrix

Page 9: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 10: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 11: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 12: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸ ︷︷ ︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Page 13: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸ ︷︷ ︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Page 14: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Remarks

I The primal formulation requires the knowledge of feature map Φ(and of course H) and these could be infinite dimensional.

I The dual formulation is entirely determined by kernel evaluations,Gram matrix and (k(xi , t))i . But poor scalability: O(n3).

Page 15: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Approximation Schemes

I Incomplete Cholesky factorization (e.g., (Fine and Scheinberg, JMLR

2001)

I Sketching (Yang et al., 2015)

I Sparse greedy approximation (Smola and Scholkopf, NIPS 2000)

I Nystrom method (e.g., Williams and Seeger, NIPS 2001)

I Random Fourier features (e.g., Rahimi and Recht, NIPS 2008), ...

Page 16: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem:

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Page 17: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem:

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Page 18: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Random Feature Approximation

(Rahimi and Recht, NIPS 2008): Draw (ωj )mj=1

i.i.d.∼ Λ.

km(x , y) =1

m

m∑j=1

cos (〈ωj , x − y〉2) = 〈Φm(x),Φm(y)〉R2m ,

where

Φm(x) = (cos(〈ω1, x〉2), . . . , cos(〈ωm, x〉2), sin(〈ω1, x〉2), . . . , sin(〈ωm, x〉2))>.

I Use the feature map idea with Φm, i.e., x 7→ Φm(x) and apply yourfavorite linear method.

Page 19: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

How good is the approximation?

(S and Szabo, NIPS 2016):

supx,y∈S

|km(x , y)− k(x , y)| = Oa.s.

(√log |S |

m

)

Optimal convergence rate

I Other results are known but they are non-optimal (Rahimi and Recht,

NIPS 2008; Sutherland and Schneider, UAI 2015).

Page 20: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor w ∈ R2m s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 21: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor w ∈ R2m s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 22: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}n

i=1 where xi ∈ X , yi ∈ RI Task: Find a regressor w ∈ R2m s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 23: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

What happens statistically?

I (Rudi and Rosasco, 2016): If m ≥ nα where 12 ≤ α < 1 with α

depending on the properties of the unknown true regressor, f ∗, then

RL,P(fm,n)−RL,P(f ∗)

achieves the minimax optimal rate as obtained in the case with noapproximation. Here L is the squared loss.

Computational gain with no statistical loss!!

Page 24: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Principal Component Analysis (PCA)

I Suppose X ∼ P be a random variable in Rd with mean µ andcovariance matrix Σ.

I Find a direction w ∈ {v : ‖v‖2 = 1} such that

Var[〈w ,X 〉2]

is maximized.

I Find a direction w ∈ Rd such that

E‖(X − µ)− 〈w , (X − µ)〉2w‖22

is minimized.

I The formulations are equivalent and the solution is the eigen vectorof the covariance matrix Σ corresponding to the largest eigenvalue.

I Can be generalized to multiple directions (find a subspace...).

I Applications: dimensionality reduction.

Page 25: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Principal Component Analysis (PCA)

I Suppose X ∼ P be a random variable in Rd with mean µ andcovariance matrix Σ.

I Find a direction w ∈ {v : ‖v‖2 = 1} such that

Var[〈w ,X 〉2]

is maximized.

I Find a direction w ∈ Rd such that

E‖(X − µ)− 〈w , (X − µ)〉2w‖22

is minimized.

I The formulations are equivalent and the solution is the eigen vectorof the covariance matrix Σ corresponding to the largest eigenvalue.

I Can be generalized to multiple directions (find a subspace...).

I Applications: dimensionality reduction.

Page 26: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Principal Component Analysis (PCA)

I Suppose X ∼ P be a random variable in Rd with mean µ andcovariance matrix Σ.

I Find a direction w ∈ {v : ‖v‖2 = 1} such that

Var[〈w ,X 〉2]

is maximized.

I Find a direction w ∈ Rd such that

E‖(X − µ)− 〈w , (X − µ)〉2w‖22

is minimized.

I The formulations are equivalent and the solution is the eigen vectorof the covariance matrix Σ corresponding to the largest eigenvalue.

I Can be generalized to multiple directions (find a subspace...).

I Applications: dimensionality reduction.

Page 27: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

I Nonlinear generalization of PCA (Scholkopf et al., 1998).

I X 7→ Φ(X ) and apply PCA.

I Provides a low dimensional manifold (curves) in Rd to efficientlyrepresent the data.

I Function space view: Find f ∈ {g ∈ H : ‖g‖H = 1} such thatVar[f (X )] is maximized, i.e.,

f ∗ = arg sup‖f ‖H=1

E[f 2(X )]− E2[f (X )]

I Using the reproducing property f (X ) = 〈f , k(·,X )〉H, we obtain

E[f 2(X )] = E〈f , k(·,X )〉2H = E〈f , (k(·,X )⊗H k(·,X ))f 〉H= 〈f ,E[k(·,X )⊗H k(·,X )]f 〉H,

assuming∫X k(x , x) dP(x) <∞.

I E[f (X )] = 〈f , µP〉H where µP :=∫X k(·, x) dP(x).

Page 28: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

I Nonlinear generalization of PCA (Scholkopf et al., 1998).

I X 7→ Φ(X ) and apply PCA.

I Provides a low dimensional manifold (curves) in Rd to efficientlyrepresent the data.

I Function space view: Find f ∈ {g ∈ H : ‖g‖H = 1} such thatVar[f (X )] is maximized, i.e.,

f ∗ = arg sup‖f ‖H=1

E[f 2(X )]− E2[f (X )]

I Using the reproducing property f (X ) = 〈f , k(·,X )〉H, we obtain

E[f 2(X )] = E〈f , k(·,X )〉2H = E〈f , (k(·,X )⊗H k(·,X ))f 〉H= 〈f ,E[k(·,X )⊗H k(·,X )]f 〉H,

assuming∫X k(x , x) dP(x) <∞.

I E[f (X )] = 〈f , µP〉H where µP :=∫X k(·, x) dP(x).

Page 29: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

I Nonlinear generalization of PCA (Scholkopf et al., 1998).

I X 7→ Φ(X ) and apply PCA.

I Provides a low dimensional manifold (curves) in Rd to efficientlyrepresent the data.

I Function space view: Find f ∈ {g ∈ H : ‖g‖H = 1} such thatVar[f (X )] is maximized, i.e.,

f ∗ = arg sup‖f ‖H=1

E[f 2(X )]− E2[f (X )]

I Using the reproducing property f (X ) = 〈f , k(·,X )〉H, we obtain

E[f 2(X )] = E〈f , k(·,X )〉2H = E〈f , (k(·,X )⊗H k(·,X ))f 〉H= 〈f ,E[k(·,X )⊗H k(·,X )]f 〉H,

assuming∫X k(x , x) dP(x) <∞.

I E[f (X )] = 〈f , µP〉H where µP :=∫X k(·, x) dP(x).

Page 30: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

I Nonlinear generalization of PCA (Scholkopf et al., 1998).

I X 7→ Φ(X ) and apply PCA.

I Provides a low dimensional manifold (curves) in Rd to efficientlyrepresent the data.

I Function space view: Find f ∈ {g ∈ H : ‖g‖H = 1} such thatVar[f (X )] is maximized, i.e.,

f ∗ = arg sup‖f ‖H=1

E[f 2(X )]− E2[f (X )]

I Using the reproducing property f (X ) = 〈f , k(·,X )〉H, we obtain

E[f 2(X )] = E〈f , k(·,X )〉2H = E〈f , (k(·,X )⊗H k(·,X ))f 〉H= 〈f ,E[k(·,X )⊗H k(·,X )]f 〉H,

assuming∫X k(x , x) dP(x) <∞.

I E[f (X )] = 〈f , µP〉H where µP :=∫X k(·, x) dP(x).

Page 31: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

f ∗ = arg sup‖f ‖H=1

〈f ,Σf 〉H,

where

Σ :=

∫Xk(·, x)⊗H k(·, x) dP(x)− µP ⊗H µP

is the covariance operator (symmetric, positive and Hilbert-Schmidt) onH.

I Spectral theorem:

Σ =∑i∈I

λiφi ⊗H φi

where I is either countable (λi → 0 as i →∞) or finite.

I Similar to PCA, the solution is the eigen function corresponding tothe largest eigenvalue of Σ.

Page 32: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Kernel PCA

f ∗ = arg sup‖f ‖H=1

〈f ,Σf 〉H,

where

Σ :=

∫Xk(·, x)⊗H k(·, x) dP(x)− µP ⊗H µP

is the covariance operator (symmetric, positive and Hilbert-Schmidt) onH.

I Spectral theorem:

Σ =∑i∈I

λiφi ⊗H φi

where I is either countable (λi → 0 as i →∞) or finite.

I Similar to PCA, the solution is the eigen function corresponding tothe largest eigenvalue of Σ.

Page 33: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAIn practice, P is unknown but have access to (Xi )

ni=1

i.i.d.∼ P.

f ∗ = arg sup‖f ‖H=1

〈f , Σf 〉H,

where

Σ :=1

n

n∑i=1

k(·,Xi )⊗H k(·,Xi )− µn ⊗H µn

is the empirical covariance operator (symmetric, positive andHilbert-Schmidt) on H and

µn :=1

n

n∑i=1

k(·,Xi ).

.I Σ is an infinite dimensional operator but with finite rank (what is its

rank?)

I Spectral theorem:

Σ =n∑

i=1

λi φi ⊗H φi .

I f ∗ is the eigen function corresponding to the largest eigenvalue of Σ.

Page 34: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAIn practice, P is unknown but have access to (Xi )

ni=1

i.i.d.∼ P.

f ∗ = arg sup‖f ‖H=1

〈f , Σf 〉H,

where

Σ :=1

n

n∑i=1

k(·,Xi )⊗H k(·,Xi )− µn ⊗H µn

is the empirical covariance operator (symmetric, positive andHilbert-Schmidt) on H and

µn :=1

n

n∑i=1

k(·,Xi ).

.I Σ is an infinite dimensional operator but with finite rank (what is its

rank?)

I Spectral theorem:

Σ =n∑

i=1

λi φi ⊗H φi .

I f ∗ is the eigen function corresponding to the largest eigenvalue of Σ.

Page 35: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAI Since Σ is an infinite dimensional operator, we are solving an infinite

dimensional eigen system,

Σφi = λi φi .

I How to solve find φi in practice?

I Approach 1: (Scholkopf et al., 1998)

f ∗ = arg sup‖f ‖H=1

1

n

n∑i=1

f 2(Xi )−

(1

n

n∑i=1

f (Xi )

)2

= arg inff∈H−1

n

n∑i=1

f 2(Xi ) +

(1

n

n∑i=1

f (Xi )

)2

+ λ‖f ‖2H

for some λ > 0. Use representer theorem, which says there exists(αi )

ni=1 ⊂ R such that

f ∗ =n∑

i=1

αik(·,Xi ).

Solving for (αi )ni=1 yields a n × n eigen system involving Gram

matrix. (Exercise)

Page 36: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAI Since Σ is an infinite dimensional operator, we are solving an infinite

dimensional eigen system,

Σφi = λi φi .

I How to solve find φi in practice?

I Approach 1: (Scholkopf et al., 1998)

f ∗ = arg sup‖f ‖H=1

1

n

n∑i=1

f 2(Xi )−

(1

n

n∑i=1

f (Xi )

)2

= arg inff∈H−1

n

n∑i=1

f 2(Xi ) +

(1

n

n∑i=1

f (Xi )

)2

+ λ‖f ‖2H

for some λ > 0. Use representer theorem, which says there exists(αi )

ni=1 ⊂ R such that

f ∗ =n∑

i=1

αik(·,Xi ).

Solving for (αi )ni=1 yields a n × n eigen system involving Gram

matrix. (Exercise)

Page 37: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAApproach 2:

I Sampling operator: S : H → Rn, f 7→ 1√n

(f (X1), . . . , f (Xn)).

I Reconstruction operator: S∗ : Rn → H, α 7→ 1√n

∑ni=1 αik(·,Xi ).

(Smale and Zhou et al., 2007)

I S∗ is the adjoint (transpose) of S and they satisfy

〈f ,S∗α〉H = 〈Sf ,α〉Rn , ∀ f ∈ H, α ∈ Rn.

I φi = 1λiS∗Hnαi where Hn := In − 1

n1n1>n and αi are the

eigenvectors of KHn with λi being the corresponding eigenvalues.

I It can be shown that Σ = S∗HnS where Hn := In − 1n1n1>n .

I Σφi = λi φi ⇒ S∗HnSφi = λi φi ⇒ SS∗HnSφi = λi Sφi

I αi := Sφi . It can be shown that K = SS∗.

I αi = Sφi ⇒ S∗Hnαi = S∗HnSφi = Σφi = λi φi .

Page 38: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAApproach 2:

I Sampling operator: S : H → Rn, f 7→ 1√n

(f (X1), . . . , f (Xn)).

I Reconstruction operator: S∗ : Rn → H, α 7→ 1√n

∑ni=1 αik(·,Xi ).

(Smale and Zhou et al., 2007)

I S∗ is the adjoint (transpose) of S and they satisfy

〈f ,S∗α〉H = 〈Sf ,α〉Rn , ∀ f ∈ H, α ∈ Rn.

I φi = 1λiS∗Hnαi where Hn := In − 1

n1n1>n and αi are the

eigenvectors of KHn with λi being the corresponding eigenvalues.

I It can be shown that Σ = S∗HnS where Hn := In − 1n1n1>n .

I Σφi = λi φi ⇒ S∗HnSφi = λi φi ⇒ SS∗HnSφi = λi Sφi

I αi := Sφi . It can be shown that K = SS∗.

I αi = Sφi ⇒ S∗Hnαi = S∗HnSφi = Σφi = λi φi .

Page 39: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCAApproach 2:

I Sampling operator: S : H → Rn, f 7→ 1√n

(f (X1), . . . , f (Xn)).

I Reconstruction operator: S∗ : Rn → H, α 7→ 1√n

∑ni=1 αik(·,Xi ).

(Smale and Zhou et al., 2007)

I S∗ is the adjoint (transpose) of S and they satisfy

〈f ,S∗α〉H = 〈Sf ,α〉Rn , ∀ f ∈ H, α ∈ Rn.

I φi = 1λiS∗Hnαi where Hn := In − 1

n1n1>n and αi are the

eigenvectors of KHn with λi being the corresponding eigenvalues.

I It can be shown that Σ = S∗HnS where Hn := In − 1n1n1>n .

I Σφi = λi φi ⇒ S∗HnSφi = λi φi ⇒ SS∗HnSφi = λi Sφi

I αi := Sφi . It can be shown that K = SS∗.

I αi = Sφi ⇒ S∗Hnαi = S∗HnSφi = Σφi = λi φi .

Page 40: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCA: Random FeaturesI Empirical kernel PCA solves an n × n linear system: Complexity is

O(n3).

I Use the idea of random features: X 7→ Φm(X ) and apply PCA.

w∗ := arg sup‖w‖=1

Var[〈w ,Φm(X )〉2] = 〈w , Σmw〉2

where

Σm :=1

n

n∑i=1

Φm(Xi )Φ>m(Xi )−

(1

n

n∑i=1

Φm(Xi )

)(1

n

n∑i=1

Φm(Xi )

)>is a covariance matrix of size 2m × 2m.

I Eigen decomposition: Σm =∑2m

i=1 λi,mφi,mφ>i,m

I w∗ is obtained by solving a 2m × 2m eigensystem: Complexity isO(m3).

What happens statistically?

Page 41: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCA: Random FeaturesI Empirical kernel PCA solves an n × n linear system: Complexity is

O(n3).

I Use the idea of random features: X 7→ Φm(X ) and apply PCA.

w∗ := arg sup‖w‖=1

Var[〈w ,Φm(X )〉2] = 〈w , Σmw〉2

where

Σm :=1

n

n∑i=1

Φm(Xi )Φ>m(Xi )−

(1

n

n∑i=1

Φm(Xi )

)(1

n

n∑i=1

Φm(Xi )

)>is a covariance matrix of size 2m × 2m.

I Eigen decomposition: Σm =∑2m

i=1 λi,mφi,mφ>i,m

I w∗ is obtained by solving a 2m × 2m eigensystem: Complexity isO(m3).

What happens statistically?

Page 42: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Empirical Kernel PCA: Random Features

Two notions:

I Reconstruction error (what does this depend on?)

I Convergence of eigenvectors (more generally, eigenspaces)

Page 43: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Eigenspace Convergence

What we have?

I Eigenvectors after approximation, (φi,m)2mi=1: these form a subspace

in R2m

I Population eigenfunctions (φi )i∈I of Σ: these form a subspace in H.

I How do we compare?

I We embed them in a common space before comparing. Thecommon space is L2(P).

Page 44: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Eigenspace ConvergenceI I : H→ L2(P), f 7→ f −

∫X f (x) dP(x)

I U : R2m → L2(P), α 7→∑2m

i=1 αi

(Φm,i −

∫X Φm,i (x) dP(x)

)where

Φm = (cos(〈ω1, ·〉2), . . . , cos(〈ωm, ·〉2), sin(〈ω1, ·〉2), . . . , sin(〈ωm, ·〉2))>.

Main results: (S and Sterge, 2017)

I ∑i=1

‖Iφi − Iφi‖L2(P) = Op

(1√n

).

Similar result holds in ‖ · ‖H (Zwald et al., NIPS 2005)

I ∑i=1

‖Uφi,m − Iφi‖L2(P) = Op

(1√m

+1√n

).

I Random features are not helpful from the point of view of eigenvector

convergence (also holds for eigenspaces).

I May be useful in the reconstruction error convergence (in progress).

Page 45: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Eigenspace ConvergenceI I : H→ L2(P), f 7→ f −

∫X f (x) dP(x)

I U : R2m → L2(P), α 7→∑2m

i=1 αi

(Φm,i −

∫X Φm,i (x) dP(x)

)where

Φm = (cos(〈ω1, ·〉2), . . . , cos(〈ωm, ·〉2), sin(〈ω1, ·〉2), . . . , sin(〈ωm, ·〉2))>.

Main results: (S and Sterge, 2017)

I ∑i=1

‖Iφi − Iφi‖L2(P) = Op

(1√n

).

Similar result holds in ‖ · ‖H (Zwald et al., NIPS 2005)

I ∑i=1

‖Uφi,m − Iφi‖L2(P) = Op

(1√m

+1√n

).

I Random features are not helpful from the point of view of eigenvector

convergence (also holds for eigenspaces).

I May be useful in the reconstruction error convergence (in progress).

Page 46: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

Eigenspace ConvergenceI I : H→ L2(P), f 7→ f −

∫X f (x) dP(x)

I U : R2m → L2(P), α 7→∑2m

i=1 αi

(Φm,i −

∫X Φm,i (x) dP(x)

)where

Φm = (cos(〈ω1, ·〉2), . . . , cos(〈ωm, ·〉2), sin(〈ω1, ·〉2), . . . , sin(〈ωm, ·〉2))>.

Main results: (S and Sterge, 2017)

I ∑i=1

‖Iφi − Iφi‖L2(P) = Op

(1√n

).

Similar result holds in ‖ · ‖H (Zwald et al., NIPS 2005)

I ∑i=1

‖Uφi,m − Iφi‖L2(P) = Op

(1√m

+1√n

).

I Random features are not helpful from the point of view of eigenvector

convergence (also holds for eigenspaces).

I May be useful in the reconstruction error convergence (in progress).

Page 47: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

References I

Fine, S. and Scheinberg, K. (2001).Efficient SVM training using low-rank kernel representations.Journal of Machine Learning Research, 2:243–264.

Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages1177–1184. Curran Associates, Inc.

Rudi, A. and Rosasco, L. (2016).Generalization properties of learning with random features.https://arxiv.org/pdf/1602.04474.pdf.

Scholkopf, B., Smola, A., and Muller, K.-R. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10:1299–1319.

Smale, S. and Zhou, D.-X. (2007).Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26:153–172.

Smola, A. J. and Scholkopf, B. (2000).Sparse greedy matrix approximation for machine learning.In Proc. 17th International Conference on Machine Learning, pages 911–918. Morgan Kaufmann, San Francisco, CA.

Sriperumbudur, B. K. and Sterge, N. (2017).Statistical consistency of kernel PCA with random features.arXiv:1706.06296.

Sriperumbudur, B. K. and Szabo, Z. (2015).Optimal rates for random Fourier features.In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems28, pages 1144–1152. Curran Associates, Inc.

Sutherland, D. and Schneider, J. (2015).On the error of random fourier features.In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862–871.

Page 48: Lecture 33mm Approximate Kernel Methodsmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath3.pdf · Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics,

References II

Williams, C. and Seeger, M. (2001).Using the Nystrom method to speed up kernel machines.In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA.MIT Press.

Yang, Y., Pilanci, M., and Wainwright, M. J. (2015).Randomized sketches for kernels: Fast and optimal non-parametric regression.Technical report.https://arxiv.org/pdf/1501.06195.pdf.

Zwald, L. and Blanchard, G. (2006).On the convergence of eigenspaces in kernel principal component analysis.In Weiss, Y., Scholkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 18, pages 1649–1656. MITPress.