CptS 570 – Machine LearningSchool of EECS
Washington State University
CptS 570 - Machine Learning 1
Or, support vector machine (SVM) Discriminant-based method◦ Learn class boundaries
Support vector consists of examples closest to boundary
Kernel computes similarity between examples◦ Maps instance space to a higher-dimensional space
where (hopefully) linear models suffice Choosing the right kernel is crucial Kernel machines among best-performing
learners
CptS 570 - Machine Learning 2
Likely to underfit using only hyperplanes But we can map the data to a nonlinear
space and use hyperplanes there◦ Φ: Rd F◦ x Φ(x)
CptS 570 - Machine Learning 3
Φ
Note we want ≥+1, not ≥0 Want instances some distance from hyperplane
CptS 570 - Machine Learning 4
{ }
( ) 1asrewritten becan which
1for 1
1for 1
such that and find if1 if1
where,
0
0
0
0
2
1
+≥+
−=−≤+
+=+≥+
∈−∈+
==
wr
rwrw
wCC
rr
tTt
ttT
ttT
t
tt
ttt
xw
xwxw
wxx
xX
Distance from instance xt to hyperplanewTxt+w0
Distance from hyperplane to closest instances is the margin
CptS 570 - Machine Learning 5
wxw
wxw )( 00 wror
w tTttT++
w
margin
Optimal separating hyperplane is the one maximizing the margin
We want to choose w maximizing ρ such that
Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖
CptS 570 - Machine Learning 6
( )t
wr tTt
∀≥+
,ρwxw 0
( ) twr tTt ∀+≥+ ,121
02 xww to subject min
Quadratic optimization problem with complexity polynomial in d (#features)
Kernel will eventually map d-dimensional space to higher-dimensional space
Prefer complexity not based on #dimensions
CptS 570 - Machine Learning 7
( ) twr tTt ∀+≥+ ,121
02 xww to subject min
Convert optimization problem to depend on number of training examples N (not d)◦ Still polynomial in N
But optimization will depend only on closest examples (support vector)◦ Typically ≪N
CptS 570 - Machine Learning 8
Rewrite quadratic optimization problem using Lagrange multipliers αt, 1 ≤ t ≤ N
Minimize Lp
CptS 570 - Machine Learning 9
( )
( )[ ]
( ) ∑∑
∑
==
=
++−=
−+−=
∀+≥+
N
t
tN
t
tTtt
N
t
tTttp
tTt
wr
wrL
twr
110
2
10
2
02
21
121
,1 subject to 21min
αα
α
xww
xww
xww
Equivalently, we can maximize Lp subject to the constraints:
Plugging these into Lp …
CptS 570 - Machine Learning 10
00
0
10
1
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
N
t
ttp
N
t
tttp
rwL
rL
α
α xww
Maximize Ld with respect to αt only Complexity O(N3)
CptS 570 - Machine Learning 11
( )
( )
( )
∑
∑∑∑
∑
∑ ∑∑
∀≥=
+−=
+−=
+−−=
t and to subject
tr
rr
rwrL
ttt
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
,002121
21
0
αα
ααα
α
ααα
xx
ww
xwww
Most αt = 0◦ I.e., rt(wTxt+w0) > 1 (xt lie outside margin)
Support vectors: xt such that αt > 0◦ I.e., rt(wTxt+w0) = 1 (xt lie on margin)
w = Σt αt rtxt
w0 = rt – wTxt for any support vector xt
◦ Typically average over all support vectors Resulting discriminant is called the support
vector machine (SVM)
CptS 570 - Machine Learning 12
CptS 570 - Machine Learning 13
O = support vectors
margin
Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation
from the margin
CptS 570 - Machine Learning 14
( ) ttTt wxr ξ−≥+ 10w
(a) Correctly classified example far from margin (ξt = 0)
(b) Correctly classified example on the margin (ξt = 0)
(c) Correctly classified example, but inside the margin (0 < ξt < 1)
(d) Incorrectly classified example (ξt ≥ 1)
Soft error =
CptS 570 - Machine Learning 15
∑t
tξ
CptS 570 - Machine Learning 16
O = support vectors
margin
Lagrangian equation with slack variables
C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp
CptS 570 - Machine Learning 17
( )[ ] ∑∑∑ −+−+−+=t
tt
t
ttTtt
t
tp wxrCL ξµξαξ 1
21
02 ww
Minimize Lp by setting derivatives to zero
Plugging these into Lp yields dual Ld Maximize Ld with respect to αt
CptS 570 - Machine Learning 18
00
00
0
10
1
=−−⇒=∂
∂
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
tttp
N
t
ttp
N
t
tttp
CL
rwL
rL
µαξ
α
α xww
Quadratic optimization problem Support vectors have αt >0◦ Examples on margin: αt < C◦ Examples inside margin or misclassified: αt = C
CptS 570 - Machine Learning 19
( )
∑
∑∑∑∀≤≤=
+−=
t, 0 and 0 subject to
21
tCr
rrL
tttt
tsTtst
t s
std
αα
ααα xx
C is a regularization parameter◦ High C high penalty for non-separable examples
(overfit)◦ Low C less penalty (underfit)◦ Determine using validation set (C=1 typical)
CptS 570 - Machine Learning 20
( )
∑
∑∑∑∀≤≤=
+−=
t, 0 and 0 subject to
21
tCr
rrL
tttt
tsTtst
t s
std
αα
ααα xx
To use previous approaches, data must be near linearly separable
If not, perhaps a transformation φ(x) will help
φ(x) are basis functions
CptS 570 - Machine Learning 21
φ
Transform d-dimensional x space to k-dimensional z space using basis functions φ(x)
z=φ(x) where zj = φj(x) , j=1,…,k
Instead of w0, assume z1 = φ1(x) ≡1
CptS 570 - Machine Learning 22
∑=
==
=k
jjj
T
T
wg
g
1)()()(
)(
xxφwx
zwz
ϕ
Replace inner product of basis functions φ(xt)Tφ(xs) with kernel function K(xt,xs)
CptS 570 - Machine Learning 23
[ ] ∑∑∑ −+−−+=t
ttt
ttTttt
tp rCL ξµξαξ 1)(
21 2 xφww
( )
∑
∑∑∑∀≤≤=
+−=
t, 0 and 0 subject to
)(21
tCr
rrL
tttt
tsTtst
t s
std
αα
ααα xx φφ
∑∑∑ +−=t
tstst
t s
std KrrL ααα ),(
21 xx
Kernel K(xt,xs) computes z-space product φ(xt)Tφ(xs) in x-space
Matrix of kernel values K, where Kts = K(xt,xs), called the Gram matrix
K should be symmetric and positive semidefinite
CptS 570 - Machine Learning 24
( )
( ) ( ) ( ) ( )
( ) ( )∑
∑
∑∑
=
==
==
t
ttt
t
TtttT
t
ttt
t
ttt
Krg
rg
rr
xxx
xφxφxφwx
xφzw
,α
α
αα
Polynomial kernel of degree q
If q=1, then use original features For example, when q=2 and d=2
CptS 570 - Machine Learning 25
( ) ( )qtTtK 1+= xxxx ,
( ) ( )( )
( ) [ ]T
T
xxxxxx
yxyxyyxxyxyx
yxyx
K
22
212121
22
22
21
2121212211
22211
2
2221
22211
1
,,,,,
,
=
+++++=
++=
+=
x
yxyx
φ
Polynomial kernel of degree 2
CptS 570 - Machine Learning 26
O = support vectors
margin
Radial basis functions (Gaussian kernel)
xt is the center s is the radius Larger s implies smoother boundaries
CptS 570 - Machine Learning 27
( )
−−= 2
2
2sK
tt
xxxx exp,
CptS 570 - Machine Learning 28
Sigmoidal functions
CptS 570 - Machine Learning 29
)12tanh(),( += tTtK xxxx
tanh
Kernel K(x,y) increases with similarity between x and y
Prior knowledge can be included in the kernel function
E.g., training examples are documents◦ K(D1,D2) = # shared words
E.g., training examples are strings (e.g., DNA)◦ K(S1,S2) = 1 / edit distance between S1 and S2◦ Edit distance is the number of insertions, deletions
and/or substitutions to transform S1 into S2
CptS 570 - Machine Learning 30
E.g., training examples are nodes in a graph (e.g., social network)
K(N1,N2) = 1 / length of shortest path connecting nodes
K(N1,N2) = #paths connecting nodes Diffusion kernel
CptS 570 - Machine Learning 31
Training examples are graphs, not feature vectors◦ E.g., carcinogenic vs. non-carcinogenic chemical
structures Compare substructures of graphs◦ E.g., walks, paths, cycles, trees, subgraphs
K(G1,G2) = number of identical random walks in both graphs
K(G1,G2) = number of subgraphs shared by both graphs
CptS 570 - Machine Learning 32
Training data from multiple modalities (e.g., biometrics, social network, audio/visual)
Construct new kernels by combining simpler kernels
If K1(x,y) and K2(x,y) are valid kernels, and c is a constant, then
◦ are valid kernels
CptS 570 - Machine Learning 33
( )( )
( ) ( )( ) ( )
+=
yxyxyxyx
yxyx
,,
,,
,
,
21
21
KK
KK
cK
K
Adaptive kernel combination
Learn both αts and ɳis
CptS 570 - Machine Learning 34
( ) ( )
( )( )∑∑
∑∑ ∑∑
∑
=
−=
==
i
tii
t
tt
t s i
stii
stst
t
td
i
m
ii
Krg
KrrL
KK
xxx
xx
yxyx
,)(
,
,,
ηα
ηααα
η
21
1
Learn K different kernel machines gi(x)◦ Each uses one class as
positive, remaining classes as negative◦ Choose class i such that
i=argmaxj gj(x)◦ Best approach in practice
CptS 570 - Machine Learning 35
Learn K(K-1)/2 kernel machines◦ Each uses one class as
positive and another class as negative◦ Easier (faster) learning per
kernel machine
CptS 570 - Machine Learning 36
Learn all margins at once
◦ zt is the class index of xt
K*N variables to optimize (expensive)
CptS 570 - Machine Learning 37
02
21
00
1
2
≥≠∀−++≥+
+ ∑∑∑=
ti
ttii
tTiz
tT
z
i t
ti
K
ii
ziww
C
tt ξξ
ξ
to subject
min
,,xwxw
w
Normally, we would use squared error
For SVM, we use ε -sensitive loss
◦ Tolerate errors up to ε◦ Errors beyond ε have only linear effect
CptS 570 - Machine Learning 38
( )( ) ( )( )
−−
<−=
otherwise if0
,ε
εε tt
tttt
frfr
frex
xx
02 )()]([))(,( wffrfre Ttttt +=−= xwxxx
Use slack variables to account for deviations beyond ε◦ ξt
+ for positive deviations◦ ξt
- for negative deviations
CptS 570 - Machine Learning 39
( )∑ −+ ++t
ttC ξξ2
21 wmin
( )( )
00
0
≥
+≤−+
+≤+−
−+
−
+
tt
ttT
tTt
rw
wr
ξξ
ξε
ξε
,
xwxw
Subject to
Non-support vectors (inside margin): Support vectors◦ ⊗ on the margin:◦ ⊠ outside margin (outlier):
CptS 570 - Machine Learning 40
0)(,0,0
)()(
))()((21
=−≤≤≤≤
−−+−
−−−=
−+−+
−+−+
−+−+
∑
∑∑
∑∑
t
t
ttt
t
t
ttt
t
t
sTtsst
t s
td
CCtosubject
r
L
αααα
ααααε
αααα xx
0== −+tt αα
CorC tt <<<< −+ αα 00CorC tt == −+ αα
CptS 570 - Machine Learning 41
Fitted line f(x) is weighted sum of support vectors
Average w0 over:
CptS 570 - Machine Learning 42
CifwrCifwr
ttTt
ttTt
<<−+=
<<++=
−
+
αε
αε
0,
0,
0
0
xwxw
∑
∑
−+
−+
−=
+−=+=
t
tttt
TtttT wwf
xw
xxxwx
)(
))(()( 00
αα
αα
CptS 570 - Machine Learning 43
0)(,0,0
)()(
)())((21
=−≤≤≤≤
−−+−
−−−=
−+−+
−+−+
−+−+
∑
∑∑
∑∑
t
t
ttt
t
t
ttt
t
t
stsst
t s
td
CCtosubject
r
,KL
αααα
ααααε
αααα xx
∑ +−=+= −+t
tttT w),Kwf 00 ()()( xxxwx αα
Polynomial (quadratic) kernel
Gaussian kernel
CptS 570 - Machine Learning 44
Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels◦ Polynomial◦ RBF◦ String
CptS 570 - Machine Learning 45
Seek optimal separating hyperplane Support vector machine (SVM) finds
hyperplane using only closest examples Kernel function allows SVM to operate in
higher dimensions Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing
learners
CptS 570 - Machine Learning 46
Top Related