Support Vector classifiers for Land Cover Classification Mahesh Pal Paul M. Mather National...

Support Vector classifiers for Land Cover Classification

Mahesh Pal Paul M. Mather

National Institute of tecnology School of geography Kurukshetra University of Nottingham

India UK

1) What is a Support vector classifier.

2) Data used

3) Results and comparison with NN

and ML classifiers

4) Conclusions.

Support vector classifiers (SVC)

• Based on statistical learning theory. Minimise the probability of misclassifying an unknown data drawn randomly(structural risk minimisation) rather than minimising the misclassification error on training data (empirical risk minimisation).

• In addition to the parameters for the classifiers, this classifier provides a set of data points (called support vectors), contain all information about classification problems.

• Nonparametric in nature.

Empirical and structural risk minimisationIn the case of two-class pattern recognition, the task of learning from examples can be formulated in the following way: given a set of decision functions

{-1,1}

where is a set of abstract parameters (Osuna et. al., 1997). For a set of examples:

The aim is to find a function that provides the smallest possible value for the average error committed on independent examples randomly drawn from the same distribution P(x, y), called the expected risk:

As P(x,y) is unknown, it is not possible to calculate R(α) thus we calculate empirical risk which is defined as:

Where functions are usually called hypotheses, and the set is called the hypothesis space,and is denoted by H (where can be radial basis network, polynomial function etc)

,:f x NR:f

kk11 y,...,,.........y, xx

k

1iiiemp

yfk

1R x

f

:f x

dydy,PyfR xxx

f

If the number of training patterns (k) used to train the classifier is limited, a low error value on a training set does not necessarily imply that the classifier has a high generalisation ability, and the empirical risk minimisation principle will be non consistent.

Vapnik and Chervonenkis (1971, 1991) showed that necessary and sufficient condition for consistency of the empirical risk minimisation principle is the fitness of the VC-dimension h of the hypothesis space H . Vapnik and Chervonenkis (1971) provides a bound on the deviation of empirical risk from the expected risk where:

In order to implement the SRM principle a nested structure of hypothesis space is introduced by dividing the entire class of functions into nested subsets

with the property that h(n) ≤ h(n + 1) where h(n) is the VC-dimension of the set . This can be achieved by training a set of machines, one for each subset and choose that trained machine whose sum of empirical risk and VC confidence is minimal (Osuna et. al., 1997).

......H.....HH n21

nH

k

log,

k

hRR emp

Linearly separable class

For a binary classification, with data (i=1, …,k) with labels = 1, Training patterns are linearly separable if:

for all y = 1

for all y = -1

where w determines the orientation of discriminating plane and b determine the offset from origin. (or weight and bias in term of NN terminology)

classification function for this will be (hypothesis space)

ixiy

1b ixw

1b ixw

bsignf b, xww

• Approach to design a support vector classifier is to maximise the margin between two supporting planes.

• A plane supports a class if all points in that class are on one side of that plane.

•These two parallel planes are pushed apart until they bump into a small number of data points for each class.

•These data points are called the Support vectors.

As SVC are designed to maximise the margin between the supporting planes. The margin is defined as : 2/

Maximising the margin is equivalent to minimising the following quadratic program:

/2

subject to

This is solved by quadratic programming optimisation techniques, by using Lagrangian multipliers and finally optimisation problem becomes

(1)

2w

2w

01by i ixw

i

k

1i

k

1iiiii

2by

2

1,b,L xwwλw

Cont.

• Eq. (1) can be minimised with respect to w and b , and the optimisation problem becomes

(2)

for and the decision rule for two

class can be written as :

k

1iiii bysignf xxx

0i

j,i

jijijii

i yy2

1L xxλ

Non-separable data

• Cortes and Vapnik (1995) suggested to relax the restriction that training vectors for one class will lie on one side of optimal hyperplane by introducing a positive slack variable and writing the equation of separating planes as:

for

and writing optimisation problem as :

with and

C is a positive constant such that

01by ii ixw0i

k

1ii

2

,....,bC

2

1min

k1

ww,

0 1 b yi i i x w

0i

0C

Cont.

• C is chosen by user and large value of C means higher penalty to errors.

Final equation for non-separable data will be as given below, with

are Lagrange multipliers to enforce positivity of the

and equation(1) becomes

i i

iii

iiiii2

bxwyCw2

1,,,b,wL

ii

Nonlinear SVC

• Set of linear hyperplanes is not flexible to provide low empirical risk for many real life problem(Minky and papert-1969)

• Two different ways to increase the flexibility of the set of functions:

1. To use a set of functions which are superpositions of linear indicator functions (like sigmoid functions in NN)

2. To map the input vectors in high dimensional space and constructing a separating hyperplane in that space.

•If it is not possible to have a decision surface defined by a linear equation, a technique proposed by Boser et.al (1992) is used.

• Feature vectors are mapped into a very high dimension feature space via a non linear mapping.

• In higher dimension feature space data are spread so as to use a linear hyperplanes as a discriminating surfaces.

• Concept of Kernel function is used to reduce the computation demand in feature space.

cont

• For this case equation (2) can be written as :

where a Kernel K is defined as

A number of kernels can be used :

Polynomial kernel

Radial basis function

where d and y are user defined

j,i

jijijii

i yy2

1L xxλ

jijiK xxxx

d1K yxyx

2e yx

Advantages/disadvantages

• Use Quadtratic programming (QP) optimisation, so no chance of local minima like NN

• Use data points closer to the boundary, so uses few number of training data (called support vectors)

• basically a two class problem, so different methods exists to create multi-class classifier, affecting their performance

• Choice of kernel and kernel specific user defined parameters may affect the final classification accuracy

•Choice of Parameter C affect the classification accuracy

Data used

• ETM+ ( study area UK, Littleport,

Cambridgeshire, 2000)

• Hyperspectral (DAIS) data (Spain].

Analysis• Random sampling was used to select training and test data.

• Different data set is used for training and testing the classifiers

• 2700 training and 2037 test pixels with 7 classes are used with ETM+ data

• 1600 training and 3800 test pixels for 8 classes are used with DAIS data

• A total of 65 features (spectral bands) was used with DAIS data as seven features with severe striping were discarded. The initial number of features used was five, and the experiment was repeated with 10, 15, …, 65 features, giving a total of 13 experiments.

Continue.• A standard back-propagation neural classifier (NN) was used. All user-defined parameters are set as recommended by Kavzoglu (2001), with one hidden layer with 26 nodes.

• Maximum likelihood (ML) was also used.

• Classification accuracy and Kappa value is computed with ETM+ data while, classification accuracy is computed with DAIS data.

•Like neural network classifiers the performance of support vector classifier depends on some user defined parameters such as kernel type, kernel specific parameters, multi-class method and the parameter C.

• For this study “ one against one” multi-class method, C= 5000, radial basis kernel and (kernel specific parameter) value as 2 is used.

Results with ETM+. Classification accuracies achieved with different classifiers.

Classifier used Accuracy (%) Kappa value

Maximum likelihood 82.9 0.80

Neural network 85.1 0.83

Support vector 87.9 0.87

Classifiers Z value

SVM vs. Neural network 2.46

SVM vs. maximum likelihood 5.45

Results with DAIS data

40

50

60

70

80

90

100

Number of bands

Acc

ura

cy (

%)

Maximum likelihood

neural network

Support vector

Maximum likelihood 66 76.7 82.4 83.7 87.7 89.9 92.7 93.8 94 93.8 93.6 85.8

neural network 47.8 69.8 76.4 82.2 84.7 90.2 91.6 89.9 93.6 92.4 93.4 93

Support vector 67.6 76.1 84.3 86.2 90.3 93.4 94.5 94.5 95 96.1 96.1 95.1

5 10 15 20 25 30 40 45 50 55 60 65

Conclusions•Performance of SVC is better in comparison with NN and ML classifiers.

• Like NN, SVC is also affected by the choice of some user-defined parameters. This study concludes that it is easier to set these parameters.

• There is no problem of local minima is SVC like NN classifiers.

• Training time by SVC is quite small as compared to NN classifier. (0.30 minute by SVC as compared to 58 minute by NN classifier on a SUN machine)

• SVC perform very well with small number of training data irrespective of number of features used.

• SV classifiers are almost unaffected by Hughes (1968) phenomenon.

Support Vector classifiers for Land Cover Classification Mahesh Pal Paul M. Mather National...

Documents

Transcript of Support Vector classifiers for Land Cover Classification Mahesh Pal Paul M. Mather National...