Dbm630 lecture07

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 7

Classification and Prediction Naïve Bayes, Regression and SVM

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Topics

2

Statistical Modeling: Naïve Bayes Classification sparseness problem missing value numeric attributes

Regression Linear Regression Regression Tree

Support Vector Machine

Data Warehousing and Data Mining by Kritsada Sriphaew

3

Statistical Modeling

“Opposite” of 1R: use all the attributes

Two assumptions: Attributes are

equally important

statistically independent (given the class value)

This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)

Although based on assumptions that are almost never correct, this scheme works well in practice!

Classification – Naïve Bayes

4

An Example: Evaluating the Weather Attributes (Revised) Outlook Temp. Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

1R chooses the attribute that

produces rules with the smallest

number of errors, i.e., rule 1 or 3

Attribute Rule Error Total Error

Outlook sunny no

overcast yes

rainy yes

2/5

0/4

2/5

4/14

Temp. hot no*

mild yes

cool yes

2/4

2/6

1/4

5/14

Humidity high no

normal yes

3/7

1/7

4/14

Windy false yes

true no*

2/8

3/6

5/14


5

Probabilities for the Weather Data

Probabilistic model


6

Bayes’s Rule

Probability of event H given evidence E:

A priori probability of H: p(H)

Probability of event before evidence has been seen

A posteriori probability of H: p(H|E)

Probability of event after evidence has been

seen

)(

)()|()|(

Ep

HpHEpEHp


7

Naïve Bayes for Classification

Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance

Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore:

)(

)()|()|()|(),,,|( 21

21Ep

HpHEpHEpHEpEEEHp n

n

)(

)()|()|(

Ep

HpHEpEHp


http://en.wikipedia.org/wiki/Statistical_independence

8

Naïve Bayes for Classification

),,,(

)()|()|()|()|(

),,,|(

twihhuctesoutp

yplpypltwipyplhhupyplctepyplsoutp

twindyhhumidctempsoutlookyplayp

),,,(

14

9

9

3

9

3

9

3

9

2

twihhuctesoutp


9

The Sparseness Problem (The “zero-frequency problem”)

What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)?

Probability will be zero! P(outlook=overcast|play=no) = 0

A posteriori probability will also be zero! (No matter how likely the other values are!)

P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0

Remedy: add 1 to the count for every attribute value-class

combination (Laplace estimator)

Result: probabilities will never be zero! (also: stabilizes probability estimates)


10

Modified Probability Estimates

In some cases adding a constant different from 1 might

be more appropriate

Example: attribute outlook for class yes

We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)

m

m

rainy

m

m

overcast

m

m

sunny

9

33

9

34

9

32

Equal weight

m

pmrainy

m

pmovercast

m

pmsunny

9

)(3

9

)(4

9

)(2

3

2

1

Normalized weight (p1 + p2 + p3 = 1)


11

Missing Value Problem

Training: instance is not included in the frequency count for attribute value-class combination

Classification: attribute will be omitted from calculation


12

Dealing with Numeric Attributes

Common assumption: attributes have a normal or Gaussian probability distribution (given the class)

The probability density function for the normal distribution is defined by: The sample mean :

The standard deviation :

The density function f(x):

-


13

An Example: Evaluating the Weather Attributes (Numeric)

Outlook Temp. Humidity Windy Play

sunny 85 85 false no

sunny 80 90 true no

overcast 83 86 false yes

rainy 70 96 false yes


rainy 65 70 true no

overcast 64 65 true yes

sunny 72 95 false no

sunny 69 70 false yes


sunny 75 70 true yes

overcast 72 90 true yes

overcast 81 75 false yes

rainy 71 91 true no


14

Statistics for the Weather Data

Example for density value:

- 𝑓 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 66 𝑦𝑒𝑠 =

1

2𝜋6.2𝑒

(66−73)2

2∗6.22 = 0.0340

𝑓 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 99 𝑛𝑜 = 1

2𝜋9.7𝑒

(99−86.2)2

2∗9.72 = 0.0380


L6: Statistical Classification Approach 15

Classify a New Case

Classify a new case (if any missing values in both training and classifying , omit them)

The case we would

like to predict

16

Probability Densities

Relationship between probability and density:

But: this doesn’t change calculation of a posteriori probabilities because is cancelled out

Exact relationship:


17

Discussion of Naïve Bayes

Naïve Bayes works surprisingly well

(even if independence assumption is clearly violated)

Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class

However: adding too many redundant attributes will cause problems (e. g. identical attributes)

Note also: many numeric attributes are not normally distributed ( kernel density estimators)


18

General Bayesian Classification

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems

Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities


19

Bayesian Theorem Given training data D, posteriori probability of a hypothesis h,

P(h|D) follows the Bayes theorem

MAP (maximum a posteriori) hypothesis

Difficulty: need initial knowledge of many probabilities, significant computational cost

If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis

P(D)P(D|h)P(h)P(h|D)

.)()|(maxarg)|(maxarg hPhDPHh

DhPHhMAP

h

.)|(maxarg)|(maxarg i

i

i hDPHh

DhPHhML

hi


20

Naïve Bayes Classifiers Assumption: attributes are conditionally independent:

Greatly reduces the computation cost, only count the class distribution. However, it is seldom satisfied in practice, as attributes (variables) are often

correlated.

Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships

between attributes Decision trees, that reason on one attribute at the time, considering most

important attributes first Association rules that reason a class by several attributes

.)|( )(maxarg

})...{|(maxarg

1

,,2,1

i

J

j

ij

i

Ji

i

cPCc

vvvcPCcMAP

c

cvP


21

Bayesian Belief Network (An Example)

Storm

Lightening

Thunder

BusTourGroup

Campfire

Forestfire

C

~C

(S,B) (S, ~B) (~S, B) (~S, ~B)

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

Also called Bayes Nets

The conditional probability table (CPT) for the variable Campfire

• Network represents a set of conditional

independence assertions.

• Directed acyclic graph


Attributes (variables) are often correlated. Each variable is conditionally independent given its predecessors

22

C

~C

(S,B) (S, ~B) (~S, B) (~S, ~B)

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

Storm

Lightening

Thunder

BusTourGroup

Campfire

Forestfire

The conditional probability table (CPT)

for the variable Campfire

Bayesian Belief Network (Dependence and Independence)

Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire)

In general,

where Parents(Yi) denotes immediate predecessors of Yi in graph

So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi))

))(|(),...,,(1

21 ii

n

i

n YParentsyPyyyP

0.7 0.85


23

Bayesian Belief Network (Inference in Bayes Nets)

Infer the values of one or more network variables, given observed values of others Bayes net contains all information needed for this inference

If only one variable with unknown value, it is easy to infer it.

In general case, the problem is NP hard.

Anyway, there are three types of inference.

Top-down inference: p(Campfire|Storm)

Bottom-up inference: p(Storm|Campfire)

Hybrid inference: p(BusTourGroup|Storm,Campfire)


24

Bayesian Belief Network (Training Bayesian Belief Networks)

Several variants of this learning task Network structure might be known or unknown

Training examples might provide values of all network variables, or just some

If structure known and observe all variables

Then it is easy as training a Naïve Bayes classifier

If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire Use gradient ascent.

Converge to network h that maximize P(D|h)


25

Numerical Modeling: Regression

Numerical model is used for prediction

Counterparts exist for all schemes that we previously discussed Decision trees, statistical models, etc.

All classification schemes can be applied to regression problems using discretization Prediction: weighted average of intervals’ midpoints

(weighted according to class probabilities)

Regression is more difficult than classification (i. e. percent correct vs. mean squared error)

Prediction – Regression

26

Linear Regression

Work most naturally with numeric attributes

Standard technique for numeric prediction

Outcome is linear combination of attributes

Weights are calculated from the training data

Predicted value for first instance

xwxwxwxw kkj

k

jj

Y)1()1(

11

)1(

00

)1(

0

...

xwxwxwxwxw kkj

k

jj

Y

...221100

0

)1(X


27

Minimize the Squared Error (I) k+1 coefficients are chosen so that the squared error

on the training data is minimized Squared error:

Coefficient can be derived using standard matrix operations Can be done if there are more instances than attributes

(roughly speaking) If there are less instances, a lot of solutions Minimization of absolute error is more difficult!

n

ixwy i

j

k

jj

i

1

2

)(

0

)(


28

Minimize the Squared Error (II)

wXYxwyn

i

ij

k

jj

i

2

1

2

min)(

0

)(min

2

1

0

)()(

1

)(

0

)1()1(

1

)1(

0

)0()0(

1

)0(

0

)(

)1(

)0(

w

ww

xxx

xxxxxx

y

y

y

k

n

k

nn

k

k

n

Xnxk wkx1 Ynx1

min


29

Example : Find the linear regression of salary data

For simplicity, x0 = 1 Therefore, With the method of least square error,

Predicted line is estimated by

Y = 23.55 + 3.5 x1

X = {x1}

Year experience Y

Salary (in $1000s)

3 30

8 57

9 64

13 72

3 36

6 43

11 59

21 90

1 20

16 83

Salary data

x = 9.1 and y = 55.4 s = # training instances = 10

xwxwxwxwxw kkj

k

jj

Y

...221100

0

xwwY110

55.23110 xwyw

5.3)(

))((

2

1 1

1 1

1

s

i

i

is

i

i

xx

yyxxw

Prediction for X = 10, Y = 23.55+3.5(10) = 58.55


30

Classification using Linear Regression (One with the others)

Any regression technique can be used for classification

Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not

Prediction: predict class corresponding to model with largest output value(membership value)

For linear regression, this is known as multi-response linear regression


For example, the data has three classes {A, B, C}.

Linear Regression Model 1: predict 1 for class A and 0 for not A

Linear Regression Model 2: predict 1 for class B and 0 for not B

Linear Regression Model 3: predict 1 for class C and 0 for not C

31

Classification using Linear Regression (Pairwise Regression)

Another way of using regression for classification:

A regression function for every pair of classes,using only instances from these two classes

An output of +1 is assigned to one member of the pair, an output of –1 to the other

Prediction is done by voting

Class that receives most votes is predicted

Alternative: “don’t know” if there is no agreement

More likely to be accurate but more expensive


For example, the data has three classes {A, B, C}.

Linear Regression Model 1: predict +1 for class A and -1 for class B

Linear Regression Model 2: predict +1 for class A and -1 for class C

Linear Regression Model 3: predict +1 for class B and -1 for class C

Regression Tree and Model Tree

PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +

0.0056 MMAX + 0.6410 CACH -

0.2700 CHMIN + 1.480 CHMAX

cycle main memory

5

3

208

2

207

209

4

…

1

time

125

29

29

29

29

...

125

480

480

min

256

8000

8000

8000

8000

...

2000

512

1000

max

6000

32000

32000

32000

16000

...

8000

8000

4000

channels

min

16

8

8

8

8

...

2

0

0

max

128

32

32

32

16

...

14

0

0

cache

(Kb)

25

6 32

32

32

32

...

0

32

0

perfor

mace

MYCT MMIN MMAX CHMIN CHMAX CACH PRP

198

269

220

172

132

...

52

67

45

CHMIN

CACH MMAX

MMAX MMAX

CACH 19.3

(28/8.7%)

29.8

(37/8.18%)

19.3

(28/8.7%)

37.3

(19/11.3%)

59.3

(24/16.9%)

MYCT

18.3

(7/3.83%)

75.7

(10/24.6%)

133

(16/28.8%)

157(21/73.7

%)

783

(5/35.9%)

CHMAX

281

(11/56%)

MMIN

492

(7/53.9%)

<=7.5 >7.5

<=8.5 (8.5,28]

>28

<=2500 (2500,4250] >4250 <=10000

>10000

>28000 <=28000

>58 <=58

>12000 <=12000

(0.5,8.5] <=0.5

<=550 >550

CHMIN

CACH MMAX

MMAX

CACH LM1

(65/7.32%)

LM4

(50/22.1%)

LM2

(26/6.37%)

LM3

(24/14.5%)

LM5(21/45.5

%)

LM6

(23/63.5%)

<=7.5 >7.5

<=8.5 >8.5

<=4250 >4250

>28000 <=28000

<=0.5 (0.5,8.5]

LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN

LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN

+ 0.946 CHMAX

LM3: PRP = 38.1 + 0.12 MMIN

LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH

+ 0.969 CHMAX

LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH

- 9.39 CHMIN

LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN

+ 4.98 CHMAX Model Tree

Regression

Tree


Regression tree is a decision tree with averaged

numeric values at the leaves.

Model tree is a tree whose leaves contain linear

regressions.

Support Vector Machine (SVM)

33

SVM is related to statistical learning theory

SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher

SVM becomes popular because of its success in handwritten digit recognition

1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.

SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning.

SVM is popularly used in classification task

Support Vector Machines

34

What is a good Decision Boundary? A two-class, linearly

separable classification

problem

Many decision boundaries!

The Perceptron algorithm can

be used to find such a

boundary

Different algorithms have been

proposed

Are all decision boundaries

equally good?

Class 1

Class 2


35

Examples of Bad Decision Boundaries

Class 1

Class 2

BEST

Class 1

Class 2


Support Vector Machines 36

Large-margin Decision Boundary

The decision boundary should be as far away from the

data of both classes as possible

We should maximize the margin, m

Distance between the origin and the line wtx=k is

k/||w||

||||

2

wm

Class 1

Class 2

m 1bxwT

0bxwT

1bxwT

w


Example

||||

2

wm

1bxwTClass 1

Class 2

m

0bxwT

1bxwT

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7 2

23

5

3/2

3/2

m

b

w

13

6

12

4

14

2

21

21

21

bww

bww

bww 111324

64221

bbbww

Distance between 2 hyperplanes

supports


Example ||||

2

wm

1bxwTClass 1

Class 2

m

0bxwT

1bxwT

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7

Best boundary:

Solve => maximize m

or minimize 𝒘

As we also want to prevent data points

falling into the margin, we add the following

constraints for each point i,

𝑤𝑇𝑥𝑖 + 𝑏 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

and

𝑤𝑇𝑥𝑖 + 𝑏 − 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠

For n point, this can be rewriten as:

=> 𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏


Primal form

Class 1

Class 2

m

w

1 2 3 4 5 6 7 0

1

0

2

3

4

5

6

7

Previously, it is difficult to solve because it depends on ||w||, the norm of w, which involves a square root

We alter the equation by

substituting ||w|| with 1

2𝑤 2 (the

factor of 1/2 being used for mathematical convenience)

This is called “Quadratic programming (QP) optimization” problem

Minimize in (w, b)

𝟏

𝟐𝒘 𝟐

subject to (for any i = 1, …, n)

𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏

How to solve this optimization and more information on SVM, e.g., dual form, kernel, can be found in the ref [1]

[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,

Cambridge University Press, 2000. http://www.support-vector.net

http://www.support-vector.net/




40

Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear

decision boundary

How to generalize it to become nonlinear?

Key idea: transform xi to a higher dimensional space to “make life easier”

Input space: the space the point xi are located

Feature space: the space of f(xi) after transformation

Why transform?

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable



Transforming the Data

Computation in the feature space can be costly because it is high dimensional

The feature space is typically infinite-dimensional!

The kernel trick can help (more info. in ref [1])

f( )

f( )

f( ) f( ) f( )

f( )

f( ) f( )

f(.) f( )

f( )

f( )

f( ) f( )

f( )

f( )

f( ) f( )

f( )

Feature space Input space Note: feature space is of higher dimension

than the input space in practice

[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,

Cambridge University Press, 2000. http://www.support-vector.net





42

Why SVM Work? The feature space is often very high dimensional. Why don’t we have

the curse of dimensionality?

A classifier in a high-dimensional space has many parameters and is hard to estimate

Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier

Typically, a classifier with many parameters is very flexible, but there are also exceptions

Let xi=10i where i ranges from 1 to n. The classifier

can classify all xi correctly for all possible combination of class labels on xi

This 1-parameter classifier is very flexible


43

Why SVM works? Vapnik argues that the flexibility of a classifier should not be

characterized by the number of parameters, but by the flexibility (capacity) of a classifier

This is formalized by the “VC-dimension” of a classifier

Consider a linear classifier in two-dimensional space

If we have three training data points, no matter how those points are labeled, we can classify them perfectly


44

VC-dimension

However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect

We can see that 3 is the critical number

The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible


45

Other Aspects of SVM How to use SVM for multi-class classification?

Original SVM is for binary classification

One can change the QP formulation to become multi-class

More often, multiple binary classifiers are combined

One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently”

How to interpret the SVM discriminant function value as probability?

By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training

Some SVM software (like libsvm) have these features built-in

A list of SVM implementation can be found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight is among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available


46

Strengths and Weaknesses of SVM

Strengths

Training is relatively easy

No local optimal, unlike in neural networks

It scales relatively well to high dimensional data

Tradeoff between classifier complexity and error can be controlled explicitly

Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

Weaknesses

Need to choose a “good” kernel function.


47

Example: Predicting a class label using naïve Bayesian classification

RID age income student Credit_rating Class:buys_computer

1 <=30 High No Fair No

2 <=30 High No Excellent No

3 31…40 High No Fair Yes

4 >40 Medium No Fair Yes

5 >40 Low Yes Fair Yes

6 >40 Low Yes Excellent No

7 31…40 Low Yes Excellent Yes

8 <=30 Medium No Fair no

9 <=30 Low Yes Fair Yes

10 >40 Medium Yes Fair Yes

11 <=30 Medium Yes Excellent Yes

12 31…40 Medium No Excellent Yes

13 31…40 High Yes Fair Yes

14 >40 medium no Excellent No

15 <=30 medium yes fair

Unknown sample

Data Warehousing and Data Mining by Kritsada Sriphaew

48

Outlook Temperature Humidity Windy Play

Sunny Hot High False N

Sunny Hot High True N

Overcast Hot High False Y

Rainy Mild High False Y

Rainy Cool Normal False Y

Rainy Cool Normal True N

Overcast Cool Normal True Y

Sunny Mild high False N

Sunny Cool Normal False Y

Rainy Mild Normal False Y

Sunny Mild Normal True Y

Overcast Hot Normal False Y

Overcast Mild High True Y

Rainy Mild High True N

Sunny Cool Normal False

Rainy Mild High False

Exercise:

Using naïve Bayesain classifier to predict those unknown data

samples

Unknown data samples

Dbm630 lecture07

Technology

Transcript of Dbm630 lecture07