Dbm630 lecture07

48
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 7 Classification and Prediction Naïve Bayes, Regression and SVM by Kritsada Sriphaew (sriphaew.k AT gmail.com)

description

 

Transcript of Dbm630 lecture07

Page 1: Dbm630 lecture07

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 7

Classification and Prediction Naïve Bayes, Regression and SVM

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630 lecture07

Topics

2

Statistical Modeling: Naïve Bayes Classification sparseness problem missing value numeric attributes

Regression Linear Regression Regression Tree

Support Vector Machine

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630 lecture07

3

Statistical Modeling

“Opposite” of 1R: use all the attributes

Two assumptions: Attributes are

equally important

statistically independent (given the class value)

This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)

Although based on assumptions that are almost never correct, this scheme works well in practice!

Classification – Naïve Bayes

Page 4: Dbm630 lecture07

4

An Example: Evaluating the Weather Attributes (Revised) Outlook Temp. Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

1R chooses the attribute that

produces rules with the smallest

number of errors, i.e., rule 1 or 3

Attribute Rule Error Total Error

Outlook sunny no

overcast yes

rainy yes

2/5

0/4

2/5

4/14

Temp. hot no*

mild yes

cool yes

2/4

2/6

1/4

5/14

Humidity high no

normal yes

3/7

1/7

4/14

Windy false yes

true no*

2/8

3/6

5/14

Classification – Naïve Bayes

Page 5: Dbm630 lecture07

5

Probabilities for the Weather Data

Probabilistic model

Classification – Naïve Bayes

Page 6: Dbm630 lecture07

6

Bayes’s Rule

Probability of event H given evidence E:

A priori probability of H: p(H)

Probability of event before evidence has been seen

A posteriori probability of H: p(H|E)

Probability of event after evidence has been

seen

)(

)()|()|(

Ep

HpHEpEHp

Classification – Naïve Bayes

Page 7: Dbm630 lecture07

7

Naïve Bayes for Classification

Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance

Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore:

)(

)()|()|()|(),,,|( 21

21Ep

HpHEpHEpHEpEEEHp n

n

)(

)()|()|(

Ep

HpHEpEHp

Classification – Naïve Bayes

Page 8: Dbm630 lecture07

8

Naïve Bayes for Classification

),,,(

)()|()|()|()|(

),,,|(

twihhuctesoutp

yplpypltwipyplhhupyplctepyplsoutp

twindyhhumidctempsoutlookyplayp

),,,(

14

9

9

3

9

3

9

3

9

2

twihhuctesoutp

Classification – Naïve Bayes

Page 9: Dbm630 lecture07

9

The Sparseness Problem (The “zero-frequency problem”)

What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)?

Probability will be zero! P(outlook=overcast|play=no) = 0

A posteriori probability will also be zero! (No matter how likely the other values are!)

P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0

Remedy: add 1 to the count for every attribute value-class

combination (Laplace estimator)

Result: probabilities will never be zero! (also: stabilizes probability estimates)

Classification – Naïve Bayes

Page 10: Dbm630 lecture07

10

Modified Probability Estimates

In some cases adding a constant different from 1 might

be more appropriate

Example: attribute outlook for class yes

We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)

m

m

rainy

m

m

overcast

m

m

sunny

9

33

9

34

9

32

Equal weight

m

pmrainy

m

pmovercast

m

pmsunny

9

)(3

9

)(4

9

)(2

3

2

1

Normalized weight (p1 + p2 + p3 = 1)

Classification – Naïve Bayes

Page 11: Dbm630 lecture07

11

Missing Value Problem

Training: instance is not included in the frequency count for attribute value-class combination

Classification: attribute will be omitted from calculation

Classification – Naïve Bayes

Page 12: Dbm630 lecture07

12

Dealing with Numeric Attributes

Common assumption: attributes have a normal or Gaussian probability distribution (given the class)

The probability density function for the normal distribution is defined by: The sample mean :

The standard deviation :

The density function f(x):

-

Classification – Naïve Bayes

Page 13: Dbm630 lecture07

13

An Example: Evaluating the Weather Attributes (Numeric)

Outlook Temp. Humidity Windy Play

sunny 85 85 false no

sunny 80 90 true no

overcast 83 86 false yes

rainy 70 96 false yes

rainy 68 80 false yes

rainy 65 70 true no

overcast 64 65 true yes

sunny 72 95 false no

sunny 69 70 false yes

rainy 75 80 false yes

sunny 75 70 true yes

overcast 72 90 true yes

overcast 81 75 false yes

rainy 71 91 true no

Classification – Naïve Bayes

Page 14: Dbm630 lecture07

14

Statistics for the Weather Data

Example for density value:

- 𝑓 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 66 𝑦𝑒𝑠 =

1

2𝜋6.2𝑒

(66−73)2

2∗6.22 = 0.0340

𝑓 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 99 𝑛𝑜 = 1

2𝜋9.7𝑒

(99−86.2)2

2∗9.72 = 0.0380

Classification – Naïve Bayes

Page 15: Dbm630 lecture07

L6: Statistical Classification Approach 15

Classify a New Case

Classify a new case (if any missing values in both training and classifying , omit them)

The case we would

like to predict

Page 16: Dbm630 lecture07

16

Probability Densities

Relationship between probability and density:

But: this doesn’t change calculation of a posteriori probabilities because is cancelled out

Exact relationship:

Classification – Naïve Bayes

Page 17: Dbm630 lecture07

17

Discussion of Naïve Bayes

Naïve Bayes works surprisingly well

(even if independence assumption is clearly violated)

Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class

However: adding too many redundant attributes will cause problems (e. g. identical attributes)

Note also: many numeric attributes are not normally distributed ( kernel density estimators)

Classification – Naïve Bayes

Page 18: Dbm630 lecture07

18

General Bayesian Classification

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems

Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

Classification – Naïve Bayes

Page 19: Dbm630 lecture07

19

Bayesian Theorem Given training data D, posteriori probability of a hypothesis h,

P(h|D) follows the Bayes theorem

MAP (maximum a posteriori) hypothesis

Difficulty: need initial knowledge of many probabilities, significant computational cost

If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis

P(D)P(D|h)P(h)P(h|D)

.)()|(maxarg)|(maxarg hPhDPHh

DhPHhMAP

h

.)|(maxarg)|(maxarg i

i

i hDPHh

DhPHhML

hi

Classification – Naïve Bayes

Page 20: Dbm630 lecture07

20

Naïve Bayes Classifiers Assumption: attributes are conditionally independent:

Greatly reduces the computation cost, only count the class distribution. However, it is seldom satisfied in practice, as attributes (variables) are often

correlated.

Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships

between attributes Decision trees, that reason on one attribute at the time, considering most

important attributes first Association rules that reason a class by several attributes

.)|( )(maxarg

})...{|(maxarg

1

,,2,1

i

J

j

ij

i

Ji

i

cPCc

vvvcPCcMAP

c

cvP

Classification – Naïve Bayes

Page 21: Dbm630 lecture07

21

Bayesian Belief Network (An Example)

Storm

Lightening

Thunder

BusTourGroup

Campfire

Forestfire

C

~C

(S,B) (S, ~B) (~S, B) (~S, ~B)

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

Also called Bayes Nets

The conditional probability table (CPT) for the variable Campfire

• Network represents a set of conditional

independence assertions.

• Directed acyclic graph

Classification – Naïve Bayes

Attributes (variables) are often correlated. Each variable is conditionally independent given its predecessors

Page 22: Dbm630 lecture07

22

C

~C

(S,B) (S, ~B) (~S, B) (~S, ~B)

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

Storm

Lightening

Thunder

BusTourGroup

Campfire

Forestfire

The conditional probability table (CPT)

for the variable Campfire

Bayesian Belief Network (Dependence and Independence)

Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire)

In general,

where Parents(Yi) denotes immediate predecessors of Yi in graph

So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi))

))(|(),...,,(1

21 ii

n

i

n YParentsyPyyyP

0.7 0.85

Classification – Naïve Bayes

Page 23: Dbm630 lecture07

23

Bayesian Belief Network (Inference in Bayes Nets)

Infer the values of one or more network variables, given observed values of others Bayes net contains all information needed for this inference

If only one variable with unknown value, it is easy to infer it.

In general case, the problem is NP hard.

Anyway, there are three types of inference.

Top-down inference: p(Campfire|Storm)

Bottom-up inference: p(Storm|Campfire)

Hybrid inference: p(BusTourGroup|Storm,Campfire)

Classification – Naïve Bayes

Page 24: Dbm630 lecture07

24

Bayesian Belief Network (Training Bayesian Belief Networks)

Several variants of this learning task Network structure might be known or unknown

Training examples might provide values of all network variables, or just some

If structure known and observe all variables

Then it is easy as training a Naïve Bayes classifier

If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire Use gradient ascent.

Converge to network h that maximize P(D|h)

Classification – Naïve Bayes

Page 25: Dbm630 lecture07

25

Numerical Modeling: Regression

Numerical model is used for prediction

Counterparts exist for all schemes that we previously discussed Decision trees, statistical models, etc.

All classification schemes can be applied to regression problems using discretization Prediction: weighted average of intervals’ midpoints

(weighted according to class probabilities)

Regression is more difficult than classification (i. e. percent correct vs. mean squared error)

Prediction – Regression

Page 26: Dbm630 lecture07

26

Linear Regression

Work most naturally with numeric attributes

Standard technique for numeric prediction

Outcome is linear combination of attributes

Weights are calculated from the training data

Predicted value for first instance

xwxwxwxw kkj

k

jj

Y)1()1(

11

)1(

00

)1(

0

...

xwxwxwxwxw kkj

k

jj

Y

...221100

0

)1(X

Prediction – Regression

Page 27: Dbm630 lecture07

27

Minimize the Squared Error (I) k+1 coefficients are chosen so that the squared error

on the training data is minimized Squared error:

Coefficient can be derived using standard matrix operations Can be done if there are more instances than attributes

(roughly speaking) If there are less instances, a lot of solutions Minimization of absolute error is more difficult!

n

ixwy i

j

k

jj

i

1

2

)(

0

)(

Prediction – Regression

Page 28: Dbm630 lecture07

28

Minimize the Squared Error (II)

wXYxwyn

i

ij

k

jj

i

2

1

2

min)(

0

)(min

2

1

0

)()(

1

)(

0

)1()1(

1

)1(

0

)0()0(

1

)0(

0

)(

)1(

)0(

w

ww

xxx

xxxxxx

y

y

y

k

n

k

nn

k

k

n

Xnxk wkx1 Ynx1

min

Prediction – Regression

Page 29: Dbm630 lecture07

29

Example : Find the linear regression of salary data

For simplicity, x0 = 1 Therefore, With the method of least square error,

Predicted line is estimated by

Y = 23.55 + 3.5 x1

X = {x1}

Year experience Y

Salary (in $1000s)

3 30

8 57

9 64

13 72

3 36

6 43

11 59

21 90

1 20

16 83

Salary data

x = 9.1 and y = 55.4 s = # training instances = 10

xwxwxwxwxw kkj

k

jj

Y

...221100

0

xwwY110

55.23110 xwyw

5.3)(

))((

2

1 1

1 1

1

s

i

i

is

i

i

xx

yyxxw

Prediction for X = 10, Y = 23.55+3.5(10) = 58.55

Prediction – Regression

Page 30: Dbm630 lecture07

30

Classification using Linear Regression (One with the others)

Any regression technique can be used for classification

Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not

Prediction: predict class corresponding to model with largest output value(membership value)

For linear regression, this is known as multi-response linear regression

Prediction – Regression

For example, the data has three classes {A, B, C}.

Linear Regression Model 1: predict 1 for class A and 0 for not A

Linear Regression Model 2: predict 1 for class B and 0 for not B

Linear Regression Model 3: predict 1 for class C and 0 for not C

Page 31: Dbm630 lecture07

31

Classification using Linear Regression (Pairwise Regression)

Another way of using regression for classification:

A regression function for every pair of classes,using only instances from these two classes

An output of +1 is assigned to one member of the pair, an output of –1 to the other

Prediction is done by voting

Class that receives most votes is predicted

Alternative: “don’t know” if there is no agreement

More likely to be accurate but more expensive

Prediction – Regression

For example, the data has three classes {A, B, C}.

Linear Regression Model 1: predict +1 for class A and -1 for class B

Linear Regression Model 2: predict +1 for class A and -1 for class C

Linear Regression Model 3: predict +1 for class B and -1 for class C

Page 32: Dbm630 lecture07

Regression Tree and Model Tree

PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +

0.0056 MMAX + 0.6410 CACH -

0.2700 CHMIN + 1.480 CHMAX

cycle main memory

5

3

208

2

207

209

4

1

time

125

29

29

29

29

...

125

480

480

min

256

8000

8000

8000

8000

...

2000

512

1000

max

6000

32000

32000

32000

16000

...

8000

8000

4000

channels

min

16

8

8

8

8

...

2

0

0

max

128

32

32

32

16

...

14

0

0

cache

(Kb)

25

6 32

32

32

32

...

0

32

0

perfor

mace

MYCT MMIN MMAX CHMIN CHMAX CACH PRP

198

269

220

172

132

...

52

67

45

CHMIN

CACH MMAX

MMAX MMAX

CACH 19.3

(28/8.7%)

29.8

(37/8.18%)

19.3

(28/8.7%)

37.3

(19/11.3%)

59.3

(24/16.9%)

MYCT

18.3

(7/3.83%)

75.7

(10/24.6%)

133

(16/28.8%)

157(21/73.7

%)

783

(5/35.9%)

CHMAX

281

(11/56%)

MMIN

492

(7/53.9%)

<=7.5 >7.5

<=8.5 (8.5,28]

>28

<=2500 (2500,4250] >4250 <=10000

>10000

>28000 <=28000

>58 <=58

>12000 <=12000

(0.5,8.5] <=0.5

<=550 >550

CHMIN

CACH MMAX

MMAX

CACH LM1

(65/7.32%)

LM4

(50/22.1%)

LM2

(26/6.37%)

LM3

(24/14.5%)

LM5(21/45.5

%)

LM6

(23/63.5%)

<=7.5 >7.5

<=8.5 >8.5

<=4250 >4250

>28000 <=28000

<=0.5 (0.5,8.5]

LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN

LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN

+ 0.946 CHMAX

LM3: PRP = 38.1 + 0.12 MMIN

LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH

+ 0.969 CHMAX

LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH

- 9.39 CHMIN

LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN

+ 4.98 CHMAX Model Tree

Regression

Tree

Prediction – Regression

Regression tree is a decision tree with averaged

numeric values at the leaves.

Model tree is a tree whose leaves contain linear

regressions.

Page 33: Dbm630 lecture07

Support Vector Machine (SVM)

33

SVM is related to statistical learning theory

SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher

SVM becomes popular because of its success in handwritten digit recognition

1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.

SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning.

SVM is popularly used in classification task

Support Vector Machines

Page 34: Dbm630 lecture07

34

What is a good Decision Boundary? A two-class, linearly

separable classification

problem

Many decision boundaries!

The Perceptron algorithm can

be used to find such a

boundary

Different algorithms have been

proposed

Are all decision boundaries

equally good?

Class 1

Class 2

Support Vector Machines

Page 35: Dbm630 lecture07

35

Examples of Bad Decision Boundaries

Class 1

Class 2

BEST

Class 1

Class 2

Support Vector Machines

Page 36: Dbm630 lecture07

Support Vector Machines 36

Large-margin Decision Boundary

The decision boundary should be as far away from the

data of both classes as possible

We should maximize the margin, m

Distance between the origin and the line wtx=k is

k/||w||

||||

2

wm

Class 1

Class 2

m 1bxwT

0bxwT

1bxwT

w

Page 37: Dbm630 lecture07

Support Vector Machines 37

Example

||||

2

wm

1bxwTClass 1

Class 2

m

0bxwT

1bxwT

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7 2

23

5

3/2

3/2

m

b

w

13

6

12

4

14

2

21

21

21

bww

bww

bww 111324

64221

bbbww

Distance between 2 hyperplanes

supports

Page 38: Dbm630 lecture07

Support Vector Machines 38

Example ||||

2

wm

1bxwTClass 1

Class 2

m

0bxwT

1bxwT

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7

Best boundary:

Solve => maximize m

or minimize 𝒘

As we also want to prevent data points

falling into the margin, we add the following

constraints for each point i,

𝑤𝑇𝑥𝑖 + 𝑏 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

and

𝑤𝑇𝑥𝑖 + 𝑏 − 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠

For n point, this can be rewriten as:

=> 𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏

Page 39: Dbm630 lecture07

Support Vector Machines 39

Primal form

Class 1

Class 2

m

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7

Previously, it is difficult to solve because it depends on ||w||, the norm of w, which involves a square root

We alter the equation by

substituting ||w|| with 1

2𝑤 2 (the

factor of 1/2 being used for mathematical convenience)

This is called “Quadratic programming (QP) optimization” problem

Minimize in (w, b)

𝟏

𝟐𝒘 𝟐

subject to (for any i = 1, …, n)

𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏

How to solve this optimization and more information on SVM, e.g., dual form, kernel, can be found in the ref [1]

[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,

Cambridge University Press, 2000. http://www.support-vector.net

Page 40: Dbm630 lecture07

40

Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear

decision boundary

How to generalize it to become nonlinear?

Key idea: transform xi to a higher dimensional space to “make life easier”

Input space: the space the point xi are located

Feature space: the space of f(xi) after transformation

Why transform?

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

Support Vector Machines

Page 41: Dbm630 lecture07

Support Vector Machines 41

Transforming the Data

Computation in the feature space can be costly because it is high dimensional

The feature space is typically infinite-dimensional!

The kernel trick can help (more info. in ref [1])

f( )

f( )

f( ) f( ) f( )

f( )

f( ) f( )

f(.) f( )

f( )

f( )

f( ) f( )

f( )

f( )

f( ) f( )

f( )

Feature space Input space Note: feature space is of higher dimension

than the input space in practice

[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,

Cambridge University Press, 2000. http://www.support-vector.net

Page 42: Dbm630 lecture07

42

Why SVM Work? The feature space is often very high dimensional. Why don’t we have

the curse of dimensionality?

A classifier in a high-dimensional space has many parameters and is hard to estimate

Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier

Typically, a classifier with many parameters is very flexible, but there are also exceptions

Let xi=10i where i ranges from 1 to n. The classifier

can classify all xi correctly for all possible combination of class labels on xi

This 1-parameter classifier is very flexible

Support Vector Machines

Page 43: Dbm630 lecture07

43

Why SVM works? Vapnik argues that the flexibility of a classifier should not be

characterized by the number of parameters, but by the flexibility (capacity) of a classifier

This is formalized by the “VC-dimension” of a classifier

Consider a linear classifier in two-dimensional space

If we have three training data points, no matter how those points are labeled, we can classify them perfectly

Support Vector Machines

Page 44: Dbm630 lecture07

44

VC-dimension

However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect

We can see that 3 is the critical number

The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible

Support Vector Machines

Page 45: Dbm630 lecture07

45

Other Aspects of SVM How to use SVM for multi-class classification?

Original SVM is for binary classification

One can change the QP formulation to become multi-class

More often, multiple binary classifiers are combined

One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently”

How to interpret the SVM discriminant function value as probability?

By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training

Some SVM software (like libsvm) have these features built-in

A list of SVM implementation can be found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight is among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available

Support Vector Machines

Page 46: Dbm630 lecture07

46

Strengths and Weaknesses of SVM

Strengths

Training is relatively easy

No local optimal, unlike in neural networks

It scales relatively well to high dimensional data

Tradeoff between classifier complexity and error can be controlled explicitly

Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

Weaknesses

Need to choose a “good” kernel function.

Support Vector Machines

Page 47: Dbm630 lecture07

47

Example: Predicting a class label using naïve Bayesian classification

RID age income student Credit_rating Class:buys_computer

1 <=30 High No Fair No

2 <=30 High No Excellent No

3 31…40 High No Fair Yes

4 >40 Medium No Fair Yes

5 >40 Low Yes Fair Yes

6 >40 Low Yes Excellent No

7 31…40 Low Yes Excellent Yes

8 <=30 Medium No Fair no

9 <=30 Low Yes Fair Yes

10 >40 Medium Yes Fair Yes

11 <=30 Medium Yes Excellent Yes

12 31…40 Medium No Excellent Yes

13 31…40 High Yes Fair Yes

14 >40 medium no Excellent No

15 <=30 medium yes fair

Unknown sample

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 48: Dbm630 lecture07

48

Outlook Temperature Humidity Windy Play

Sunny Hot High False N

Sunny Hot High True N

Overcast Hot High False Y

Rainy Mild High False Y

Rainy Cool Normal False Y

Rainy Cool Normal True N

Overcast Cool Normal True Y

Sunny Mild high False N

Sunny Cool Normal False Y

Rainy Mild Normal False Y

Sunny Mild Normal True Y

Overcast Hot Normal False Y

Overcast Mild High True Y

Rainy Mild High True N

Sunny Cool Normal False

Rainy Mild High False

Exercise:

Using naïve Bayesain classifier to predict those unknown data

samples

Unknown data samples