Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW...

Nonlinear Knowledge in Kernel Machines

Olvi MangasarianUW Madison & UCSD La Jolla

Edward WildUW Madison

Data Mining and Mathematical Programming WorkshopCentre de Recherches Mathématiques

Université de Montréal, Québec October 10-13, 2006

ObjectivesPrimary objective: Incorporate prior knowledge

over completely arbitrary sets into: function approximation, and classification without transforming (kernelizing) the knowledge

Secondary objective: Achieve transparency of the prior knowledge for practical applications

Graphical Example of Knowledge Incorporation

+

++

++

+++

g(x) · 0

h1(x) · 0

h2(x) · 0Similar approach for approximation

K(x0, B0)u =

Outline

Kernels in classification and function approximation Incorporation of prior knowledge

Previous approaches: require transformation of knowledge New approach does not require any transformation of knowledge

• Knowledge given over completely arbitrary sets

Fundamental tool used Theorem of the alternative for convex functions

Experimental results Four synthetic examples and two examples related to breast cancer prognosis Classifiers and approximations with prior knowledge more accurate than those

without prior knowledge

Farkas: For A2 Rk£ n, b2 Rn, exactly one of the following must hold:I. Ax 0, b0x > 0 has solution x 2 Rn, ,or II. A0 v =b, v 0, has solution v 2 Rk

Let h:½ Rn! Rk, : ! R, convex set, h and convex on andh(x)<0 for some x2 . Then, exactly one of the following must hold:I. h(x) 0, (x) < 0 has a solution x , orII. v Rk, v 0: v0h(x)+(x) 0 x

To get II of Farkas:v 0, v0Ax- b0x 0 8 x 2 Rn , v 0, A0 v -b=0.

Theorem of the Alternative for Convex Functions:Generalization of the Farkas Theorem

Classification and Function Approximation

Given a set of m points in n-dimensional real space Rn with corresponding labels

Labels in {+1, -1} for classification problems Labels in R for approximation problems

Points are represented by rows of a matrix A 2 Rm£n

Corresponding labels or function values are given by a vector y

Classification: y 2 {+1, -1}m

Approximation: y Rm

Find a function f(Ai) = yi based on the given data points Ai

f : Rn ! {+1, -1} for classification f : Rn! R for approximation

Classification and Function Approximation

Problem: utilizing only given data may result in a poor classifier or approximationPoints may be noisySampling may be costly

Solution: use prior knowledge to improve the classifier or approximation

Adding Prior Knowledge

Standard approximation and classification: fit function at given data points without knowledge

Constrained approximation: satisfy inequalities at given points

Previous approaches (2001 FMS, 2003 FMS, and 2004 MSW ): satisfy linear inequalities over polyhedral regions

Proposed new approach: satisfy nonlinear inequalities over arbitrary regions without kernelizing (transforming) knowledge

Kernel Machines

Approximate f by a nonlinear kernel function K using parameters u 2 Rk and in R

A kernel function is a nonlinear generalization of the scalar product

f(x) K(x0, B0)u - , x 2 Rn, K:Rn £ Rn£k ! Rk

Gaussian K(x0, B0)i=-||x-Bi||2, i=1,…..,k

B 2 Rk£n is a basis matrixUsually, B = A2Rm£n = Input data matrix In Reduced Support Vector Machines, B is a small

subset of the rows of AB may be any matrix with n columns

Kernel Machines

Introduce slack variable s to measure error in classification or approximation

Error s in kernel approximation of given data: -s K(A, B0)u - e - y s, e is a vector of ones in Rm

Function approximation: f(x) x0, B0)u - Error s in kernel classification of given data

K(A+, B0)u - e + s+ ¸ e, s+ ¸ 0K(A- , B0)u - e - s- - e, s- ¸ 0

More succinctly, let: D = diag(y), the m£m matrix with diagonal y of § 1’s, then:D(K(A, B0)u - e) + s ¸ e, s ¸ 0Classifier: f(x) sign(x0, B0)u -

Trade off between solution complexity (||u||1) and data fitting (||s||1)At solution

e0a = ||u||1 e0s = ||s||1

Kernel Machines in Approximation OR Classification

OR

Kernel Machines in Approximation

Cx d w0x- h0x + Need to “kernelize” knowledge from input space to

transformed (feature) space of kernel Requires change of variable x = A0t, w = A0uCA0t d u0AA0t - h0A0t + K(C, A0)t d u0K(A, A0)t - h0A0t + Use a linear theorem of the alternative in the t spaceLost connection with original knowledgeAchieves good numerical results, but is not readily

interpretable in the original space

Incorporating Nonlinear Prior Knowledge: Previous Approaches

Nonlinear Prior Knowledge in Function Approximation: New Approach

Start with arbitrary nonlinear knowledge implication

g(x) 0 K(x0, B0)u - h(x), 8x 2 ½ Rn

g, h are arbitrary functions on g:! Rk, h:! RLinear in v, u, 9v ¸ 0: v0g(x) + K(x0, B0)u - - h(x) ¸ 0 8x 2

Assume that g(x), K(x0, B0)u - , -h(x) are convex functions of x, that is convex and 9 x 2 : g(x)<0. Then either: I. g(x) 0, K(x0, B0)u - - h(x) < 0 has a solution x , or II. v Rk, v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x But never both.

If we can find v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x , then by above theorem

g(x) 0, K(x0, B0)u - - h(x) < 0 has no solution x or equivalently:

g(x) 0 K(x0, B0)u - h(x), 8x 2

Theorem of the Alternative for Convex Functions

Proof I II (Not needed for present application)

Follows from a fundamental theorem of Fan-Glicksburg-Hoffman for convex functions [1957] and the existenceof an x 2 such that g(x) <0.

I II (Requires no assumptions on g, h, K, or uwhatsoever)Suppose not. That is, there exists x 2 v 2 Rk,, v¸ 0:g(x) 0, K(x0, B0)u - - h(x) < 0, (I) v 0, v0g(x) +K(x0, B0)u - - h(x) 0 , 8 x 2 Then we have the contradiction:

0 > v0g(x) +K(x0, B0)u - - h(x) 0 · 0 < 0

Linear semi-infinite program: infinite number of constraints

Discretize: finite linear programg(xi) · 0 ) K(xi0, B0)u - ¸ h(xi), i = 1, …, kSlacks allow knowledge to be satisfied inexactlyAdd term to objective function to drive slacks to zero

Incorporating Prior Knowledge

Numerical Experience: Approximation

Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database (WPBC)

194 patients £ 2 histogical features • tumor size & number of metastasized lymph nodes

Compare approximation with prior knowledge to approximation without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled

exactly by previous work (MSW 2004)No kernelization needed on knowledge set

Two-Dimensional Hyperboloid

f(x1, x2) = x1x2


Given exact values only at 11 points along line x1 = x2

At x1 2 {-5, …, 5}

x1

x2

Two-Dimensional Hyperboloid Approximation without Prior Knowledge


Add prior (inexact) knowledge: x1x2 1 f(x1, x2) x1x2

Nonlinear term x1x2 can not be handled exactly by any previous approaches

Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}

Two-Dimensional Hyperboloid Approximation with Prior Knowledge

Two-Dimensional Tower Function

Two-Dimensional Tower Function (Misleading) Data

Given 400 points on the grid [-4, 4] [-4, 4]Values are min{g(x), 2}, where g(x) is the

exact tower function

Two-Dimensional Tower Function Approximation without Prior Knowledge

Two-Dimensional Tower FunctionPrior Knowledge

Add prior knowledge:(x1, x2) [-4, 4] [-4, 4] f(x) = g(x)

Prior knowledge is the exact function value.Enforced at 2500 points on the grid

[-4, 4] [-4, 4] through above implicationPrincipal objective of prior knowledge here

is to overcome poor given data

Two-Dimensional Tower Function Approximation with Prior Knowledge

Breast Cancer Application:Predicting Lymph Node Metastasis as a Function of

Tumor SizeNumber of metastasized lymph nodes is an important

prognostic indicator for breast cancer recurrence Determined by surgery in addition to the removal of the

tumorOptional procedure especially if tumor size is small

Wisconsin Prognostic Breast Cancer (WPBC) dataLymph node metastasis and tumor size for 194 patients

Task: predict the number of metastasized lymph nodes given tumor size alone

Predicting Lymph Node Metastasis

Split data into two portionsPast data: 20% used to find prior knowledgePresent data: 80% used to evaluate performance

Past data simulates prior knowledge obtained from an expert

Prior Knowledge for Lymph Node Metastasis as a Function of Tumor Size

Generate prior knowledge by fitting past data: h(x) := K(x0, B0)u - B is the matrix of the past data points

Use density estimation to decide where to enforce knowledgep(x) is the empirical density of the past data

Prior knowledge utilized on approximating function f(x):Number of metastasized lymph nodes is greater than the

predicted value on past data, with tolerance of 0.01p(x) 0.1 f(x) ¸ h(x) - 0.01

Predicting Lymph Node Metastasis: Results

RMSE: root-mean-squared-error LOO: leave-one-out error Improvement due to knowledge: 14.9%

Approximation ErrorPrior knowledge h(x) based

on past data 20%6.12 RMSE

f(x) without knowledge based on present data 80%

5.92 LOO

f(x) with knowledge based on present data 80%

5.04 LOO

Incorporating Prior Knowledge in Classification (Very Similar)

Implication for positive regiong(x) 0 K(x0, B0)u - , 8x 2 ½ Rn

K(x0, B0)u - - + v0g(x) ¸ 0, v ¸ 0, 8x 2 Similar implication for negative regionsAdd discretized constraints to linear program

Incorporating Prior Knowledege: Classification

Numerical Experience: Classification

Evaluate on three datasetsTwo synthetic datasetsWisconsin Prognostic Breast Cancer Database

Compare classifier with prior knowledge to one without prior knowledgePrior knowledge leads to an improved accuracyGeneral prior knowledge used cannot be handled

exactly by previous work (FMS 2001, FMS 2003)No kernelization needed on knowledge set

Checkerboard DatasetBlack & White Points in R2

Checkerboard Classifier Without Knowledge Using 16 Center Points

Prior Knowledge for 16-Point Checkerboard Classifier

Checkerboard Classifier With Knowledge Using 16 Center Points

100100

Spiral Dataset 194 Points in R2

No misclassified points Labels given only at 100 correctly classified circled points

Spiral Classifier With Knowledge

Spiral Classifier Without KnowledgeBased on 100 Labeled PointsPrior Knowledge Function for Spiral

White ) +Gray ) •

Note the many incorrectly

classified +’s

Prior knowledge imposed at 291 points in each region

Predicting Breast Cancer Recurrence Within 24 Months

Wisconsin Prognostic Breast Cancer (WPBC) dataset 155 patients monitored for recurrence within 24 months 30 cytological features 2 histological features: number of metastasized lymph nodes and tumor size

Predict whether or not a patient remains cancer free after 24 months 82% of patients remain disease free 86% accuracy (Bennett, 1992) best previously attained Prior knowledge allows us to incorporate additional information to

improve accuracy

Generating WPBC Prior KnowledgeGray regions indicate areas where g(x) · 0

Simulate oncological surgeon’s advice about recurrence

Tumor Size in CentimetersNum

ber o

f Met

asta

size

d Ly

mph

Nod

es

Knowledge imposed at dataset points inside given regions

+ Recur within 24 months

• Cancer free within 24 months+ Recur• Cancer free

WPBC Results

49.7 % improvement due to knowledge35.7 % improvement over best previous predictor

Classifier Misclassification Rate

Without Knowledge 18.1% .

With Knowledge 9.0% .

ConclusionGeneral nonlinear prior knowledge incorporated into kernel

classification and approximation Implemented as linear inequalities in a linear programming

problemKnowledge appears transparently

Demonstrated effectivenessFour synthetic examplesTwo real world problems from breast cancer prognosis

Future workPrior knowledge with more general implicationsUser-friendly interface for knowledge specificationGenerate prior knowledge for real-world datasets

Website Links to Talk & Papers

http://www.cs.wisc.edu/~olvihttp://www.cs.wisc.edu/~wildt

Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW...

Documents

Transcript of Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison UCSD La Jolla Edward Wild UW...