Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt...
Transcript of Fast Kernel Methodslsong/teaching/8803ML/lecture21.pdfQR decomposition Essentially Gram-Schmidt...
Fast Kernel Methods
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
Kernel low rank approximation
Incomplete Cholesky factorization of kernel matrix πΎ of size π Γ π to π of size π Γ π, and π βͺ π
π π₯ | π₯π , π¦π π=1π ~πΊπ ππππ π‘ π₯ , ππππ π‘ π₯, π₯
β²
ππππ π‘ π₯ = π π₯β€ π π β€ + πππππ π
2 πΌβ1π πβ€
ππππ π‘ π₯, π₯β² = π π₯π₯ β π π₯
β€ π π β€ + πππππ π2 πΌ
β1(π π β€)π π₯
2
πΎ β π β€
π
π΄ β π β€
π
Incomplete Cholesky Decomposition
We have a few things to understand
Gram-Schmidt orthogonalization
Given a set of vectors V = {π£1, π£2, β¦ , π£π}, find a set of orthonormal
basis π = π’1, π’2, β¦ π’π , π’πβ€π’π = 0, π’π
β€π’π = 0
QR decomposition
Given a set of orthonormal basis π, compute the projection of π
onto π, π£π = ππππ’ππ , π = πππ
π = ππ
Cholesky decomposition with pivots
π β π : , 1: π π 1: π, βΆ
Kernelization
πβ€π = π β€πβ€ππ = π β€π β π 1: π, βΆ β€ π 1: π, βΆ
πΎ = Ξ¦β€Ξ¦ β π 1: π, βΆ β€ π 1: π, βΆ
3
Gram-Schmidt orthogonalization
Given a set of vectors V = {π£1, π£2, β¦ , π£π}, find a set of orthonormal basis
π = π’1, π’2, β¦ π’π , π’πβ€π’π = 0, π’π
β€π’π = 0
π’1 can be found by picking an arbitrary π£1 and normalize
π’1 =π£1
π£1
π’2 can be found by picking a vector π£2 and subtract out multiple of π’1, and then normalize
π2 = π£2 β < π£2, π’1 > π’1
π’2 =π2
π2
ππ = π£π β < π£π , π’π > π’π πβ1π=1
4
π’1
π’2
π£2
π£1
π2
QR decomposition
Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection
Given a set of vectors V = {π£1, π£2, β¦ , π£π}, find a set of orthonormal basis π = π’1, π’2, β¦ π’π using Gram-Schmidt orthogonalization
The projection of π£π on to basis vector π’π is πππ =< π£π , π’π >
π£1 = π’1 < π’1, π£1 >
π£2 = π’1 < π’1, π£2 > +π’2 < π’2, π£2 >
π£3 = π’1 < π’1, π£3 > +π’2 < π’2, π£3 > +π’3 < π’3, π£3 >
β¦
π£π = < π£π , π’π > π’π ππ=1
5
QR decomposition
Because use the original data point to form basis vectors, vector π£π only have π nonzeros components
π£π = < π£π , π’π > π’π =ππ=1 ππππ’π
ππ=1
Collect terms into matrix format
6
π = π£1, β¦ , π£π , π£π β π π
π = (π’1, β¦ , π’π) π = (π:π , β¦ , π’:π) zeros
QR decomposition with pivots
QR decomposition
If we only choose a few basis vectors, then its approximation
The basis vectors is formed from the original data points
how to order/choose from the original data points?
such that small approximation error?
order/choose from data points: choosing pivots 7
π = (π’1, β¦ , π’π) π = (π:π , β¦ , π’:π) zeros π =
Cholesky decomposition
πΎ is symmetric and positive definite matrix, then πΎ can be decomposed as
πΎ = π β€π
Since πΎ is a kernel matrix, we can find an implicit feature space
πΎ = Ξ¦β€Ξ¦,where Ξ¦ = π π₯1 , β¦ , π π₯π
QR decomposition on Ξ¦ = QR
πΎ = Rβ€Qβ€QR = π β€π
Incomplete Cholesky decomposition
Use QR decomposition with pivots
πΎ β π 1: π, : β€π (1: π, : )
8
πΎ β π β€
π
Random features
What basis to use?
πππβ²(π₯βπ¦) can be replaced by cos (π π₯ β π¦ ) since both π π₯ β π¦
and π π real functions
cos π π₯ β π¦ = cos ππ₯ cos ππ¦ + sin ππ₯ sin ππ¦
For each π, use feature [cos ππ₯ , sin ππ₯ ]
What randomness to use?
Randomly draw π from π π
Eg. Gaussian RBF kernel, drawn from Gaussian
9
String Kernels
Compare two sequences for similarity
Exact matching kernel
Counting all matching substrings
Flexible weighting scheme
Does not work well for noisy case
Successful applications in bio-informatics
Linear time algorithm using suffix trees 10
K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG
GCATGAC GCCATTG ACCTGCT GGTCCTA
Exact matching string kernels
Bag of Characters
Count single characters, set π€π = 0 for π > 1
Bag of Words
s is bounded by whitespace
Limited range correlations
Set π€π = 0 for all π > π given a fixed π
K-spectrum kernel
Account for matching substrings of length π, set π€π = 0 for all π β π
11
Suffix trees
Definitions: compact tree built from all the suffixes of a string.
Eg. suffix tree of ababc denoted by S(ababc)
Node Label = unique path from the root
Suffix links are used to speed up parsing of strings: if we are at node ππ₯ then suffix links help us to jump to node π₯
Represent all the substrings of a given string
Can be constructed in linear time and stored in linear space
Each leaf corresponds to a unique suffix
Leaves on the subtree give number of occurrence 12
Combining classifiers
Average results from several different models
Bagging
Stacking (meta-learning)
Boosting
Why?
Better classification performance than individual classifiers
More resilience to noise
Concerns
Take more time to obtain the final model
Overfitting
13
Bagging
Bagging: Bootstrap aggregating
Generate B bootstrap samples of the training data: uniformly random sampling with replacement
Train a classifier or a regression function using each bootstrap sample
For classification: majority vote on the classification results
For regression: average on the predicted values
Advantage:
Simple
Reduce variance
Improve performance for unstable classifier which may vary significantly with small changes in the dataset.
14
Bagging Example
Sample with replacement
15
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
Stacking classifiers
16
Level-0 models are based on different learning models and use original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called βgeneralizerβ
If you have lots of models, you can stacking into deeper hierarchies
Boosting
Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule
A family of methods which produce a sequence of classifiers
Each classifier is dependent on the previous one and focuses on the previous oneβs errors
Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier.
Questions:
How to choose βhardestβ examples?
How to combine these classifiers?
17
AdaBoost
18
Toy Example
Weak classifier (rule of thumb): vertical or horizontal half-planes
Uniform weights on all examples
19
Boosting round 1
Choose a rule of thumb (weak classifier)
Some data points obtain higher weights because they are classified incorrectly
20
Boosting round 2
Choose a new rule of thumb
Reweight again. For incorrectly classified examples, weight increased
21
Boosting round 3
Repeat the same process
Now we have 3 classifiers
22
Boosting aggregate classifier
Final classifier is weighted combination of weak classifiers
23