Classification ( SVMs / Kernel method)

56
Sp’10 Bafna/Ideker Classification (SVMs / Kernel method)

description

Classification ( SVMs / Kernel method). LP versus Quadratic programming. LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time. - PowerPoint PPT Presentation

Transcript of Classification ( SVMs / Kernel method)

Page 1: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Classification (SVMs / Kernel method)

Page 2: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

LP versus Quadratic programming

min cT xAx ≤ bx ≥ 0

min xTQx + cT xAx ≤ bx ≥ 0

• LP: linear constraints, linear objective function

• LP can be solved in polynomial time.

• In QP, the objective function contains a quadratic form.

• For +ve semindefinite Q, the QP can be solved in polynomial time

Page 3: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Margin of separation

• Suppose we find a separating hyperplane (, 0) s.t.– For all +ve points

x• Tx-0>=1

– For all +ve points x• Tx-0 <= -1

• What is the margin of separation?

Tx- 0=0

Tx- 0=1

Tx- 0=-1

Page 4: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Separating by a wider margin

• Solutions with a wider margin are better.

Maximize 2β 2 , or Minimize β

2

2

Page 5: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Separating via misclassification

• In general, data is not linearly separable• What if we also wanted to minimize misclassified points• Recall that, each sample xi in our training set has the

label yi {-1,1} • For each point i, yi(Txi-0) should be positive• Define i >= max {0, 1- yi(Txi-0) }• If i is correctly classified ( yi(Txi-0) >= 1), and i = 0• If i is incorrectly classified, or close to the boundaries i

> 0• We must minimize ii

Page 6: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Support Vector machines (wide margin and misclassification)

• Maximimize margin while minimizing misclassification

• Solved using non-linear optimization techniques

• The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method.

• This gives a lot of power to the method. €

min β2

2 +C ξ ii∑

Page 7: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Reformulating the optimization

min β2

2 +C ξ ii∑

i ≥ 0

ξ i ≥1− y i βT x i −β 0( )

Page 8: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Lagrangian relaxation

L =β 2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

• Goal

• S.t.

• We minimize€

minβ 2

2+C ξ ii∑

i ≥ 0

ξ i ≥1− y i βT x i −β 0( )

Page 9: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Simplifying

L =β 2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

= β T β2

− α iy ix ii∑ ⎛

⎝ ⎜ ⎞ ⎠ ⎟+ C −α i − λ i( )

i∑ ξ i + α ii

∑ y iβ 0 + α ii∑

• For fixed >= 0, >= 0, we minimize the lagrangian

∂L∂β

= β − y ii∑ α ix i = 0 (1)

∂L∂β 0

= y ii∑ α i = 0 (2)

∂L∂ξ i

=C −α i − λ i = 0 (3)

Page 10: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Substituting

• Substituting (1)€

L =β 2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

= β T β2

− α iy ix ii∑ ⎛

⎝ ⎜ ⎞ ⎠ ⎟+ C −α i − λ i( )

i∑ ξ i + α ii

∑ y iβ 0 + α ii∑

L = − 12

α iα j y iy jx iT x j

i, j∑ + C −α i − λ i( )

i∑ ξ i + α ii∑ y iβ 0 + α ii∑

Page 11: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

• Substituting (2,3), we have the minimization problem€

L = − 12

α iα jy iy jx iT x j

i, j∑ + C −α i − λ i( )

i∑ ξ i + α ii∑ y iβ 0 + α ii∑

min − 12

α iα jy iy jx iT x j

i, j∑ + α ii

∑s.t.

y iα ii∑ = 0

0 ≤ α i ≤C

Page 12: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Classification using SVMs

• Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques

• Quiz: When we have solved this QP, how do we classify a point x?

f (x) = β T x −β 0 = y ii∑ α ix i

T x −β 0

Page 13: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

The kernel method

• The SVM formulation can be solved using QP on dot-products.

• As these are wide-margin classifiers, they provide a more robust solution.

• However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces)

Page 14: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

kernel

• Let X be the set of objects – Ex: X =the set of samples in micro-arrays.– Each object xX is a vector of gene

expression values• k: X X -> R is a positive semidefinite

kernel if– k is symmetric.– k is +ve semidefinite

k(x,x ') = k(x',x)

cT kc ≥ 0 ∀c ∈ R p

Page 15: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Kernels as dot-product

• Quiz: Suppose the objects x are all real vectors (as in gene expression)

• Define

• Is kL a kernel? It is symmetric, but is is +ve semi-definite?€

kL x,x'( ) = xT x'

Page 16: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Linear kernel is +ve semidefinite

• Recall X as a matrix, such that each column is a sample– X=[x1 x2 …]

• By definition, the linear kernel kL=XTX• For any c

cT kLc = cTX TXc = Xc 2 ≥ 0

Page 17: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Generalizing kernels

• Any object can be represented by a feature vector in real space.

φ : X → R p

k(x,x ') = φ(x)T φ(x')

Page 18: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Generalizing

• Note that the feature mapping could actually be non-linear.

• On the flip side, Every kernel can be represented as a dot-product in a high dimensional space.

• Sometimes the kernel space is easier to define than the mapping

Page 19: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

The kernel trick

• If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel– Simply replace the dot-product by the kernel

Page 20: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Kernel trick example

• Consider a kernel k defined on a mapping – k(x,x’) = (x)T (x’)

• It could be that is very difficult to compute explicitly, but k is easy to compute

• Suppose we define a distance function between two objects as

• How do we compute this distance?€

d(x,x ') = φ(x) −φ(x')

d(x,x') = φ(x) −φ(x') = φ(x)T φ(x) + φ(x ')T φ(x ') − 2φ(x)T φ(x ')

= k(x,x) + k(x',x ') − 2k(x,x')

Page 21: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Kernels and SVMs

• Recall that SVM based classification is described as

min − 12

α iα jy iy jx iT x j

i, j∑ + α ii

∑s.t.

y iα ii∑ = 0

0 ≤α i ≤ C

Page 22: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Kernels and SVMs

• Applying the kernel trick

• We can try kernels that are biologically relevant€

min − 12

α iα j y iy jk(x ii, j∑ ,x j ) + α ii

∑s.t.

y iα ii∑ = 0

0 ≤α i ≤C

Page 23: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Examples of kernels for vectors

linear kernel kL (x,x') = xT x '

poly kernel kp (x,x') = xT x'+c( )d

Gaussian RBF kernel kG (x,x') = exp −x − x ' 2

2σ 2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Page 24: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

String kernel

• Consider a string s = s1, s2,…

• Define an index set I as a subset of indices

• s[I] is the substring limited to those indices

• l(I) = span• W(I) = cl(I) c<1

– Weight decreases as span increases

• For any string u of length k

l(I)

φu(s) = c l(I )

I :s(I )= u∑

Page 25: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

String Kernel

• Map every string to a ||n dimensional space, indexed by all strings u of length upto n

• The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time

φs u

φu(s)

Page 26: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

SVM conclusion

• SVM are a generic scheme for classifying data with wide margins and low misclassifications

• For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification– Define a meaningful kernel, and solve using

SVM• Many standard kernels are available

(linear, poly., RBF, string)

Page 27: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Classification review• We started out by treating the classification

problem as one of separating points in high dimensional space

• Obvious for gene expression data, but applicable to any kind of data

• Question of separability, linear separation• Algorithms for classification

– Perceptron– Lin. Discriminant– Max Likelihood– Linear Programming– SVMs– Kernel methods & SVM

Page 28: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Classification review

• Recall that we considered 3 problems:– Group together samples in an unsupervised

fashion (clustering)– Classify based on a training data (often by

learning a hyperplane that separates).– Selection of marker genes that are

diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.

Page 29: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Dimensionality reduction

• Many genes have highly correlated expression profiles.

• By discarding some of the genes, we can greatly reduce the dimensionality of the problem.

• There are other, more principled ways to do such dimensionality reduction.

Page 30: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Why is high dimensionality bad?

• With a high enough dimensionality, all points can be linearly separated.

• Recall that a point xi is misclassified if – it is +ve, but Txi-0<=0– it is -ve, but Txi+0 > 0

• In the first case choose i s.t. – Txi-0+i >= 0

• By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!

Page 31: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Principle Components Analysis

• We get the intrinsic dimensionality of a data-set.

Page 32: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Principle Components Analysis

• Consider the expression values of 2 genes over 6 samples.

• Clearly, the expression of the two genes is highly correlated.

• Projecting all the genes on a single line could explain most of the data.

• This is a generalization of “discarding the gene”.

Page 33: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Projecting

• Consider the mean of all points m, and a vector emanating from the mean

• Algebraically, this projection on means that all samples x can be represented by a single value T(x-m)

m

x

x-m

T =

MT(x-m)

Page 34: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Higher dimensions

• Consider a set of 2 (k) orthonormal vectors 1, 2…

• Once projected, each sample means that all samples x can be represented by 2 (k) dimensional vector– 1

T(x-m), 2T(x-m)

1m

x

x-m1

T

=

M

1T(x-m)

2

Page 35: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

How to project

• The generic scheme allows us to project an m dimensional surface into a k dimensional one.

• How do we select the k ‘best’ dimensions?

• The strategy used by PCA is one that maximizes the variance of the projected points around the mean

Page 36: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

PCA

• Suppose all of the data were to be reduced by projecting to a single line from the mean.

• How do we select the line ?

m

Page 37: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

PCA cont’d

• Let each point xk map to x’k=m+ak. We want to minimize the error

• Observation 1: Each point xk maps to x’k = m + T(xk-m)– (ak= T(xk-m))€

xk − x 'k2

k∑

m

xk

x’k

Page 38: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Proof of Observation 1

minak xk − x'k2

= minak xk −m + m − x'k2

= minak xk −m 2 + m − x'k2 − 2(x'k −m)T (xk −m)

= minak xk −m 2 + ak2β Tβ − 2akβ

T (xk −m)

= minak xk −m 2 + ak2 − 2akβ

T (xk −m)

2ak − 2β T (xk −m) = 0

ak = β T (xk −m)

⇒ ak2 = akβ

T (xk −m)

⇒ xk − x 'k2 = xk −m 2 −β T (xk −m)(xk −m)T β

Differentiating w.r.t ak

Page 39: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Minimizing PCA Error

• To minimize error, we must maximize TS• By definition, = TS implies that is an eigenvalue,

and the corresponding eigenvector.• Therefore, we must choose the eigenvector

corresponding to the largest eigenvalue.€

xk − x 'kk

∑ 2

=C − β Tk

∑ (xk −m)(xk −m)T β =C − β TSβ

Page 40: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

PCA steps• X = starting matrix with n

columns, m rows

xjX

1. m = 1n

x jj=1

n

2. hT = 11L 1[ ]

3. M = X −mhT

4. S = MMT = x j −m( )j=1

n

∑ x j −m( )T

5. BTSB = Λ

6. Return BTM

Page 41: Classification  ( SVMs  / Kernel method)

End of Lecture

Sp’10 Bafna/Ideker

Page 42: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Page 43: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

ALL-AML classification

• The two leukemias need different different therapeutic regimen.

• Usually distinguished through hematopathology

• Can Gene expression be used for a more definitive test?– 38 bonemarrow samples– Total mRNA was hybridized against probes for 6817

genes– Q: Are these classes separable

Page 44: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Neighborhood analysis (cont’d)

• Each gene is represented by an expression vector v(g) = (e1,e2,…,en)

• Choose an idealized expression vector as center.

• Discriminating genes will be ‘closer’ to the center (any distance measure can be used).

Discriminating gene

Page 45: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Neighborhood analysis

• Q: Are there genes, whose expression correlates with one of the two classes

• A: For each class, create an idealized vector c– Compute the number of genes Nc whose expression

‘matches’ the idealized expression vector– Is Nc significantly larger than Nc* for a random c*?

Page 46: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Neighborhood test• Distance measure used:

– For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2

– Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)].

– P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)] – N1(c,r) = {g | P(g,c) == r}– High density for some r is indicative of correlation with class

distinction– Neighborhood is significant if a random center does not

produce the same density.

Page 47: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Neighborhood analysis• #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by

chance.• Class prediction should be possible using

micro-array expression values.

Page 48: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Class prediction

• Choose a fixed set of informative genes (based on their correlation with the class distinction).– The predictor is uniquely defined by the sample and the subset

of informative genes.• For each informative gene g, define (wg,bg).

– wg=P(g,c) (When is this +ve?)– bg = [1(g)+2(g)]/2

• Given a new sample X– xg is the normalized expression value at g– Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and

negative for class 2)

Page 49: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Prediction Strength

• PS = [Vwin-Vlose]/[Vwin+Vlose]– Reflects the margin of victory

• A 50 gene predictor is correct 36/38 (cross-validation)

• Prediction accuracy on other samples 100% (prediction made for 29/34 samples.

• Median PS = 0.73• Other predictors between 10 and 200 genes all

worked well.

Page 50: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Performance

Page 51: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Differentially expressed genes?

• Do the predictive genes reveal any biology?

• Initial expectation is that most genes would be of a hematopoetic lineage.

• However, many genes encode– Cell cycle progression genes– Chromatin remodelling– Transcription– Known oncogenes– Leukemia targets (etopside)

Page 52: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Relationship between ML, and Golub predictor

• ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier€

p(x |ωi) = 1

2π( )d2 Σ

12

exp − 12x −m( )

T Σ−1 x −m( ) ⎛ ⎝ ⎜

⎞ ⎠ ⎟

gi(x) = ln p(x |ωi)( ) + lnP(ωi)

Compute argmaxi gi(x)

gi(x) = ln p(x |ωi)( ) + lnP(ωi)

gi(x) =x j −μ kj( )

2

σ j2

j=1

p

g1(x) − g2(x) =μ1 j −μ2 j( )σ j

2 x j −μ1 j + μ2 j( )

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

j∑

Page 53: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Automatic class discovery

• The classification of different cancers is over years of hypothesis driven research.

• Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?

Page 54: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Self Organizing Maps

• SOMs was applied to group the 38 samples

• Class A1 contained 24/25 ALL and 3/13 AML samples.

• How can we validate this?• Use the labels to do supervised

classification via cross-validation• A 20 gene predictor gave 34 accurate

predictions, 1 error, and 2 of 3 uncertains

Page 55: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Comparing various error models

Page 56: Classification  ( SVMs  / Kernel method)

Sp’10 Bafna/Ideker

Conclusion