Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt...

23
Fast Kernel Methods Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Le Song

Transcript of Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt...

Page 1: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Fast Kernel Methods

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Page 2: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Kernel low rank approximation

Incomplete Cholesky factorization of kernel matrix 𝐾 of size 𝑛 Γ— 𝑛 to 𝑅 of size 𝑑 Γ— 𝑛, and 𝑑 β‰ͺ 𝑛

𝑓 π‘₯ | π‘₯𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝐺𝑃 π‘šπ‘π‘œπ‘ π‘‘ π‘₯ , π‘˜π‘π‘œπ‘ π‘‘ π‘₯, π‘₯

β€²

π‘šπ‘π‘œπ‘ π‘‘ π‘₯ = 𝑅π‘₯⊀ π‘…π‘…βŠ€ + πœŽπ‘›π‘œπ‘–π‘ π‘’

2 πΌβˆ’1π‘…π‘ŒβŠ€

π‘˜π‘π‘œπ‘ π‘‘ π‘₯, π‘₯β€² = 𝑅π‘₯π‘₯ βˆ’ 𝑅π‘₯

⊀ π‘…π‘…βŠ€ + πœŽπ‘›π‘œπ‘–π‘ π‘’2 𝐼

βˆ’1(π‘…π‘…βŠ€)𝑅π‘₯

2

𝐾 β‰ˆ π‘…βŠ€

𝑅

𝐴 β‰ˆ π‘…βŠ€

𝑅

Page 3: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Incomplete Cholesky Decomposition

We have a few things to understand

Gram-Schmidt orthogonalization

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal

basis 𝑄 = 𝑒1, 𝑒2, … 𝑒𝑛 , π‘’π‘–βŠ€π‘’π‘— = 0, 𝑒𝑖

βŠ€π‘’π‘– = 0

QR decomposition

Given a set of orthonormal basis 𝑄, compute the projection of 𝑉

onto 𝑄, 𝑣𝑖 = π‘Ÿπ‘—π‘–π‘’π‘—π‘— , 𝑅 = π‘Ÿπ‘—π‘–

𝑉 = 𝑄𝑅

Cholesky decomposition with pivots

𝑉 β‰ˆ 𝑄 : , 1: π‘˜ 𝑅 1: π‘˜, ∢

Kernelization

π‘‰βŠ€π‘‰ = π‘…βŠ€π‘„βŠ€π‘„π‘… = π‘…βŠ€π‘… β‰ˆ 𝑅 1: π‘˜, ∢ ⊀ 𝑅 1: π‘˜, ∢

𝐾 = Φ⊀Φ β‰ˆ 𝑅 1: π‘˜, ∢ ⊀ 𝑅 1: π‘˜, ∢

3

Page 4: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Gram-Schmidt orthogonalization

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis

𝑄 = 𝑒1, 𝑒2, … 𝑒𝑛 , π‘’π‘–βŠ€π‘’π‘— = 0, 𝑒𝑖

βŠ€π‘’π‘– = 0

𝑒1 can be found by picking an arbitrary 𝑣1 and normalize

𝑒1 =𝑣1

𝑣1

𝑒2 can be found by picking a vector 𝑣2 and subtract out multiple of 𝑒1, and then normalize

π‘Ž2 = 𝑣2 βˆ’ < 𝑣2, 𝑒1 > 𝑒1

𝑒2 =π‘Ž2

π‘Ž2

π‘Žπ‘– = 𝑣𝑖 βˆ’ < 𝑣𝑖 , 𝑒𝑗 > 𝑒𝑗 π‘–βˆ’1𝑗=1

4

𝑒1

𝑒2

𝑣2

𝑣1

π‘Ž2

Page 5: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

QR decomposition

Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis 𝑄 = 𝑒1, 𝑒2, … 𝑒𝑛 using Gram-Schmidt orthogonalization

The projection of 𝑣𝑖 on to basis vector 𝑒𝑗 is π‘Ÿπ‘—π‘– =< 𝑣𝑖 , 𝑒𝑗 >

𝑣1 = 𝑒1 < 𝑒1, 𝑣1 >

𝑣2 = 𝑒1 < 𝑒1, 𝑣2 > +𝑒2 < 𝑒2, 𝑣2 >

𝑣3 = 𝑒1 < 𝑒1, 𝑣3 > +𝑒2 < 𝑒2, 𝑣3 > +𝑒3 < 𝑒3, 𝑣3 >

…

𝑣𝑖 = < 𝑣𝑖 , 𝑒𝑗 > 𝑒𝑗 𝑖𝑗=1

5

Page 6: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

QR decomposition

Because use the original data point to form basis vectors, vector 𝑣𝑖 only have 𝑖 nonzeros components

𝑣𝑖 = < 𝑣𝑖 , 𝑒𝑗 > 𝑒𝑗 =𝑖𝑗=1 π‘Ÿπ‘—π‘–π‘’π‘—

𝑖𝑗=1

Collect terms into matrix format

6

𝑉 = 𝑣1, … , 𝑣𝑛 , 𝑣𝑖 ∈ 𝑅𝑑

𝑄 = (𝑒1, … , 𝑒𝑑) 𝑅 = (π‘Ÿ:𝑖 , … , 𝑒:𝑛) zeros

Page 7: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

QR decomposition with pivots

QR decomposition

If we only choose a few basis vectors, then its approximation

The basis vectors is formed from the original data points

how to order/choose from the original data points?

such that small approximation error?

order/choose from data points: choosing pivots 7

𝑄 = (𝑒1, … , 𝑒𝑑) 𝑅 = (π‘Ÿ:𝑖 , … , 𝑒:𝑛) zeros 𝑉 =

Page 8: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Cholesky decomposition

𝐾 is symmetric and positive definite matrix, then 𝐾 can be decomposed as

𝐾 = π‘…βŠ€π‘…

Since 𝐾 is a kernel matrix, we can find an implicit feature space

𝐾 = Φ⊀Φ,where Ξ¦ = πœ™ π‘₯1 , … , πœ™ π‘₯𝑛

QR decomposition on Ξ¦ = QR

𝐾 = R⊀Q⊀QR = π‘…βŠ€π‘…

Incomplete Cholesky decomposition

Use QR decomposition with pivots

𝐾 β‰ˆ 𝑅 1: 𝑑, : βŠ€π‘…(1: 𝑑, : )

8

𝐾 β‰ˆ π‘…βŠ€

𝑅

Page 9: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Random features

What basis to use?

π‘’π‘—πœ”β€²(π‘₯βˆ’π‘¦) can be replaced by cos (πœ” π‘₯ βˆ’ 𝑦 ) since both π‘˜ π‘₯ βˆ’ 𝑦

and 𝑝 πœ” real functions

cos πœ” π‘₯ βˆ’ 𝑦 = cos πœ”π‘₯ cos πœ”π‘¦ + sin πœ”π‘₯ sin πœ”π‘¦

For each πœ”, use feature [cos πœ”π‘₯ , sin πœ”π‘₯ ]

What randomness to use?

Randomly draw πœ” from 𝑝 πœ”

Eg. Gaussian RBF kernel, drawn from Gaussian

9

Page 10: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

String Kernels

Compare two sequences for similarity

Exact matching kernel

Counting all matching substrings

Flexible weighting scheme

Does not work well for noisy case

Successful applications in bio-informatics

Linear time algorithm using suffix trees 10

K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG

GCATGAC GCCATTG ACCTGCT GGTCCTA

Page 11: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Exact matching string kernels

Bag of Characters

Count single characters, set 𝑀𝑠 = 0 for 𝑠 > 1

Bag of Words

s is bounded by whitespace

Limited range correlations

Set 𝑀𝑠 = 0 for all 𝑠 > 𝑛 given a fixed 𝑛

K-spectrum kernel

Account for matching substrings of length π‘˜, set 𝑀𝑠 = 0 for all 𝑠 β‰  π‘˜

11

Page 12: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Suffix trees

Definitions: compact tree built from all the suffixes of a string.

Eg. suffix tree of ababc denoted by S(ababc)

Node Label = unique path from the root

Suffix links are used to speed up parsing of strings: if we are at node π‘Žπ‘₯ then suffix links help us to jump to node π‘₯

Represent all the substrings of a given string

Can be constructed in linear time and stored in linear space

Each leaf corresponds to a unique suffix

Leaves on the subtree give number of occurrence 12

Page 13: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Combining classifiers

Average results from several different models

Bagging

Stacking (meta-learning)

Boosting

Why?

Better classification performance than individual classifiers

More resilience to noise

Concerns

Take more time to obtain the final model

Overfitting

13

Page 14: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Bagging

Bagging: Bootstrap aggregating

Generate B bootstrap samples of the training data: uniformly random sampling with replacement

Train a classifier or a regression function using each bootstrap sample

For classification: majority vote on the classification results

For regression: average on the predicted values

Advantage:

Simple

Reduce variance

Improve performance for unstable classifier which may vary significantly with small changes in the dataset.

14

Page 15: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Bagging Example

Sample with replacement

15

Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8

Page 16: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Stacking classifiers

16

Level-0 models are based on different learning models and use original data (level-0 data)

Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called β€œgeneralizer”

If you have lots of models, you can stacking into deeper hierarchies

Page 17: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Boosting

Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule

A family of methods which produce a sequence of classifiers

Each classifier is dependent on the previous one and focuses on the previous one’s errors

Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier.

Questions:

How to choose β€œhardest” examples?

How to combine these classifiers?

17

Page 18: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

AdaBoost

18

Page 19: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Toy Example

Weak classifier (rule of thumb): vertical or horizontal half-planes

Uniform weights on all examples

19

Page 20: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Boosting round 1

Choose a rule of thumb (weak classifier)

Some data points obtain higher weights because they are classified incorrectly

20

Page 21: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Boosting round 2

Choose a new rule of thumb

Reweight again. For incorrectly classified examples, weight increased

21

Page 22: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Boosting round 3

Repeat the same process

Now we have 3 classifiers

22

Page 23: Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection Given

Boosting aggregate classifier

Final classifier is weighted combination of weak classifiers

23