Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW...

43
Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop Centre de Recherches Mathématiques Université de Montréal, Québec October 10-13, 2006

description

Graphical Example of Knowledge Incorporation +         g(x) · 0 h 1 (x) · 0 h 2 (x) · 0  Similar approach for approximation K(x 0, B 0 )u = 

Transcript of Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW...

Page 1: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Nonlinear Knowledge in Kernel Machines

Olvi MangasarianUW Madison & UCSD La Jolla

Edward WildUW Madison

Data Mining and Mathematical Programming WorkshopCentre de Recherches Mathématiques

Université de Montréal, Québec October 10-13, 2006

Page 2: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

ObjectivesPrimary objective: Incorporate prior knowledge

over completely arbitrary sets into: function approximation, and classification without transforming (kernelizing) the knowledge

Secondary objective: Achieve transparency of the prior knowledge for practical applications

Page 3: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Graphical Example of Knowledge Incorporation

+

++

++

+++

g(x) · 0

h1(x) · 0

h2(x) · 0Similar approach for approximation

K(x0, B0)u =

Page 4: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Outline

Kernels in classification and function approximation Incorporation of prior knowledge

Previous approaches: require transformation of knowledge New approach does not require any transformation of knowledge

• Knowledge given over completely arbitrary sets

Fundamental tool used Theorem of the alternative for convex functions

Experimental results Four synthetic examples and two examples related to breast cancer prognosis Classifiers and approximations with prior knowledge more accurate than those

without prior knowledge

Page 5: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Farkas: For A2 Rk£ n, b2 Rn, exactly one of the following must hold:I. Ax 0, b0x > 0 has solution x 2 Rn, ,or II. A0 v =b, v 0, has solution v 2 Rk

Let h:½ Rn! Rk, : ! R, convex set, h and convex on andh(x)<0 for some x2 . Then, exactly one of the following must hold:I. h(x) 0, (x) < 0 has a solution x , orII. v Rk, v 0: v0h(x)+(x) 0 x

To get II of Farkas:v 0, v0Ax- b0x 0 8 x 2 Rn , v 0, A0 v -b=0.

Theorem of the Alternative for Convex Functions:Generalization of the Farkas Theorem

Page 6: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Classification and Function Approximation

Given a set of m points in n-dimensional real space Rn with corresponding labels

Labels in {+1, -1} for classification problems Labels in R for approximation problems

Points are represented by rows of a matrix A 2 Rm£n

Corresponding labels or function values are given by a vector y

Classification: y 2 {+1, -1}m

Approximation: y Rm

Find a function f(Ai) = yi based on the given data points Ai

f : Rn ! {+1, -1} for classification f : Rn! R for approximation

Page 7: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Classification and Function Approximation

Problem: utilizing only given data may result in a poor classifier or approximationPoints may be noisySampling may be costly

Solution: use prior knowledge to improve the classifier or approximation

Page 8: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Adding Prior Knowledge

Standard approximation and classification: fit function at given data points without knowledge

Constrained approximation: satisfy inequalities at given points

Previous approaches (2001 FMS, 2003 FMS, and 2004 MSW ): satisfy linear inequalities over polyhedral regions

Proposed new approach: satisfy nonlinear inequalities over arbitrary regions without kernelizing (transforming) knowledge

Page 9: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Kernel Machines

Approximate f by a nonlinear kernel function K using parameters u 2 Rk and in R

A kernel function is a nonlinear generalization of the scalar product

f(x) K(x0, B0)u - , x 2 Rn, K:Rn £ Rn£k ! Rk

Gaussian K(x0, B0)i=-||x-Bi||2, i=1,…..,k

B 2 Rk£n is a basis matrixUsually, B = A2Rm£n = Input data matrix In Reduced Support Vector Machines, B is a small

subset of the rows of AB may be any matrix with n columns

Page 10: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Kernel Machines

Introduce slack variable s to measure error in classification or approximation

Error s in kernel approximation of given data: -s K(A, B0)u - e - y s, e is a vector of ones in Rm

Function approximation: f(x) x0, B0)u - Error s in kernel classification of given data

K(A+, B0)u - e + s+ ¸ e, s+ ¸ 0K(A- , B0)u - e - s- - e, s- ¸ 0

More succinctly, let: D = diag(y), the m£m matrix with diagonal y of § 1’s, then:D(K(A, B0)u - e) + s ¸ e, s ¸ 0Classifier: f(x) sign(x0, B0)u -

Page 11: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Trade off between solution complexity (||u||1) and data fitting (||s||1)At solution

e0a = ||u||1 e0s = ||s||1

Kernel Machines in Approximation OR Classification

OR

Kernel Machines in Approximation

Page 12: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Cx d w0x- h0x + Need to “kernelize” knowledge from input space to

transformed (feature) space of kernel Requires change of variable x = A0t, w = A0uCA0t d u0AA0t - h0A0t + K(C, A0)t d u0K(A, A0)t - h0A0t + Use a linear theorem of the alternative in the t spaceLost connection with original knowledgeAchieves good numerical results, but is not readily

interpretable in the original space

Incorporating Nonlinear Prior Knowledge: Previous Approaches

Page 13: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Nonlinear Prior Knowledge in Function Approximation: New Approach

Start with arbitrary nonlinear knowledge implication

g(x) 0 K(x0, B0)u - h(x), 8x 2 ½ Rn

g, h are arbitrary functions on g:! Rk, h:! RLinear in v, u, 9v ¸ 0: v0g(x) + K(x0, B0)u - - h(x) ¸ 0 8x 2

Page 14: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Assume that g(x), K(x0, B0)u - , -h(x) are convex functions of x, that is convex and 9 x 2 : g(x)<0. Then either: I. g(x) 0, K(x0, B0)u - - h(x) < 0 has a solution x , or II. v Rk, v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x But never both.

If we can find v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x , then by above theorem

g(x) 0, K(x0, B0)u - - h(x) < 0 has no solution x or equivalently:

g(x) 0 K(x0, B0)u - h(x), 8x 2

Theorem of the Alternative for Convex Functions

Page 15: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Proof I II (Not needed for present application)

Follows from a fundamental theorem of Fan-Glicksburg-Hoffman for convex functions [1957] and the existenceof an x 2 such that g(x) <0.

I II (Requires no assumptions on g, h, K, or uwhatsoever)Suppose not. That is, there exists x 2 v 2 Rk,, v¸ 0:g(x) 0, K(x0, B0)u - - h(x) < 0, (I) v 0, v0g(x) +K(x0, B0)u - - h(x) 0 , 8 x 2 Then we have the contradiction:

0 > v0g(x) +K(x0, B0)u - - h(x) 0 · 0 < 0

Page 16: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Linear semi-infinite program: infinite number of constraints

Discretize: finite linear programg(xi) · 0 ) K(xi0, B0)u - ¸ h(xi), i = 1, …, kSlacks allow knowledge to be satisfied inexactlyAdd term to objective function to drive slacks to zero

Incorporating Prior Knowledge

Page 17: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Numerical Experience: Approximation

Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database (WPBC)

194 patients £ 2 histogical features • tumor size & number of metastasized lymph nodes

Compare approximation with prior knowledge to approximation without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled

exactly by previous work (MSW 2004)No kernelization needed on knowledge set

Page 18: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Hyperboloid

f(x1, x2) = x1x2

Page 19: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Hyperboloid

Given exact values only at 11 points along line x1 = x2

At x1 2 {-5, …, 5}

x1

x2

Page 20: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Hyperboloid Approximation without Prior Knowledge

Page 21: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Hyperboloid

Add prior (inexact) knowledge: x1x2 1 f(x1, x2) x1x2

Nonlinear term x1x2 can not be handled exactly by any previous approaches

Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}

Page 22: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Hyperboloid Approximation with Prior Knowledge

Page 23: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Tower Function

Page 24: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Tower Function (Misleading) Data

Given 400 points on the grid [-4, 4] [-4, 4]Values are min{g(x), 2}, where g(x) is the

exact tower function

Page 25: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Tower Function Approximation without Prior Knowledge

Page 26: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Tower FunctionPrior Knowledge

Add prior knowledge:(x1, x2) [-4, 4] [-4, 4] f(x) = g(x)

Prior knowledge is the exact function value.Enforced at 2500 points on the grid

[-4, 4] [-4, 4] through above implicationPrincipal objective of prior knowledge here

is to overcome poor given data

Page 27: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Two-Dimensional Tower Function Approximation with Prior Knowledge

Page 28: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Breast Cancer Application:Predicting Lymph Node Metastasis as a Function of

Tumor SizeNumber of metastasized lymph nodes is an important

prognostic indicator for breast cancer recurrence Determined by surgery in addition to the removal of the

tumorOptional procedure especially if tumor size is small

Wisconsin Prognostic Breast Cancer (WPBC) dataLymph node metastasis and tumor size for 194 patients

Task: predict the number of metastasized lymph nodes given tumor size alone

Page 29: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Predicting Lymph Node Metastasis

Split data into two portionsPast data: 20% used to find prior knowledgePresent data: 80% used to evaluate performance

Past data simulates prior knowledge obtained from an expert

Page 30: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Prior Knowledge for Lymph Node Metastasis as a Function of Tumor Size

Generate prior knowledge by fitting past data: h(x) := K(x0, B0)u - B is the matrix of the past data points

Use density estimation to decide where to enforce knowledgep(x) is the empirical density of the past data

Prior knowledge utilized on approximating function f(x):Number of metastasized lymph nodes is greater than the

predicted value on past data, with tolerance of 0.01p(x) 0.1 f(x) ¸ h(x) - 0.01

Page 31: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Predicting Lymph Node Metastasis: Results

RMSE: root-mean-squared-error LOO: leave-one-out error Improvement due to knowledge: 14.9%

Approximation ErrorPrior knowledge h(x) based

on past data 20%6.12 RMSE

f(x) without knowledge based on present data 80%

5.92 LOO

f(x) with knowledge based on present data 80%

5.04 LOO

Page 32: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Incorporating Prior Knowledge in Classification (Very Similar)

Implication for positive regiong(x) 0 K(x0, B0)u - , 8x 2 ½ Rn

K(x0, B0)u - - + v0g(x) ¸ 0, v ¸ 0, 8x 2 Similar implication for negative regionsAdd discretized constraints to linear program

Page 33: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Incorporating Prior Knowledege: Classification

Page 34: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Numerical Experience: Classification

Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database

Compare classifier with prior knowledge to one without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled

exactly by previous work (FMS 2001, FMS 2003)No kernelization needed on knowledge set

Page 35: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Checkerboard DatasetBlack & White Points in R2

Page 36: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Checkerboard Classifier Without Knowledge Using 16 Center Points

Prior Knowledge for 16-Point Checkerboard Classifier

Checkerboard Classifier With Knowledge Using 16 Center Points

100100

Page 37: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Spiral Dataset 194 Points in R2

Page 38: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

No misclassified points Labels given only at 100 correctly classified circled points

Spiral Classifier With Knowledge

Spiral Classifier Without KnowledgeBased on 100 Labeled PointsPrior Knowledge Function for Spiral

White ) +Gray ) •

Note the many incorrectly

classified +’s

Prior knowledge imposed at 291 points in each region

Page 39: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Predicting Breast Cancer Recurrence Within 24 Months

Wisconsin Prognostic Breast Cancer (WPBC) dataset 155 patients monitored for recurrence within 24 months 30 cytological features 2 histological features: number of metastasized lymph nodes and tumor size

Predict whether or not a patient remains cancer free after 24 months 82% of patients remain disease free 86% accuracy (Bennett, 1992) best previously attained Prior knowledge allows us to incorporate additional information to

improve accuracy

Page 40: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Generating WPBC Prior KnowledgeGray regions indicate areas where g(x) · 0

Simulate oncological surgeon’s advice about recurrence

Tumor Size in CentimetersNum

ber o

f Met

asta

size

d Ly

mph

Nod

es

Knowledge imposed at dataset points inside given regions

+ Recur within 24 months

• Cancer free within 24 months+ Recur• Cancer free

Page 41: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

WPBC Results

49.7 % improvement due to knowledge35.7 % improvement over best previous predictor

Classifier Misclassification Rate

Without Knowledge 18.1% .

With Knowledge 9.0% .

Page 42: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

ConclusionGeneral nonlinear prior knowledge incorporated into kernel

classification and approximation Implemented as linear inequalities in a linear programming

problemKnowledge appears transparently

Demonstrated effectivenessFour synthetic examplesTwo real world problems from breast cancer prognosis

Future workPrior knowledge with more general implicationsUser-friendly interface for knowledge specificationGenerate prior knowledge for real-world datasets

Page 43: Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison  UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Website Links to Talk & Papers

http://www.cs.wisc.edu/~olvihttp://www.cs.wisc.edu/~wildt