Dbm630 lecture07
-
Upload
tokyo-institute-of-technology -
Category
Technology
-
view
281 -
download
0
description
Transcript of Dbm630 lecture07
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 7
Classification and Prediction Naïve Bayes, Regression and SVM
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Topics
2
Statistical Modeling: Naïve Bayes Classification sparseness problem missing value numeric attributes
Regression Linear Regression Regression Tree
Support Vector Machine
Data Warehousing and Data Mining by Kritsada Sriphaew
3
Statistical Modeling
“Opposite” of 1R: use all the attributes
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)
Although based on assumptions that are almost never correct, this scheme works well in practice!
Classification – Naïve Bayes
4
An Example: Evaluating the Weather Attributes (Revised) Outlook Temp. Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
1R chooses the attribute that
produces rules with the smallest
number of errors, i.e., rule 1 or 3
Attribute Rule Error Total Error
Outlook sunny no
overcast yes
rainy yes
2/5
0/4
2/5
4/14
Temp. hot no*
mild yes
cool yes
2/4
2/6
1/4
5/14
Humidity high no
normal yes
3/7
1/7
4/14
Windy false yes
true no*
2/8
3/6
5/14
Classification – Naïve Bayes
5
Probabilities for the Weather Data
Probabilistic model
Classification – Naïve Bayes
6
Bayes’s Rule
Probability of event H given evidence E:
A priori probability of H: p(H)
Probability of event before evidence has been seen
A posteriori probability of H: p(H|E)
Probability of event after evidence has been
seen
)(
)()|()|(
Ep
HpHEpEHp
Classification – Naïve Bayes
7
Naïve Bayes for Classification
Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance
Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore:
)(
)()|()|()|(),,,|( 21
21Ep
HpHEpHEpHEpEEEHp n
n
)(
)()|()|(
Ep
HpHEpEHp
Classification – Naïve Bayes
8
Naïve Bayes for Classification
),,,(
)()|()|()|()|(
),,,|(
twihhuctesoutp
yplpypltwipyplhhupyplctepyplsoutp
twindyhhumidctempsoutlookyplayp
),,,(
14
9
9
3
9
3
9
3
9
2
twihhuctesoutp
Classification – Naïve Bayes
9
The Sparseness Problem (The “zero-frequency problem”)
What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)?
Probability will be zero! P(outlook=overcast|play=no) = 0
A posteriori probability will also be zero! (No matter how likely the other values are!)
P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0
Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
Result: probabilities will never be zero! (also: stabilizes probability estimates)
Classification – Naïve Bayes
10
Modified Probability Estimates
In some cases adding a constant different from 1 might
be more appropriate
Example: attribute outlook for class yes
We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)
m
m
rainy
m
m
overcast
m
m
sunny
9
33
9
34
9
32
Equal weight
m
pmrainy
m
pmovercast
m
pmsunny
9
)(3
9
)(4
9
)(2
3
2
1
Normalized weight (p1 + p2 + p3 = 1)
Classification – Naïve Bayes
11
Missing Value Problem
Training: instance is not included in the frequency count for attribute value-class combination
Classification: attribute will be omitted from calculation
Classification – Naïve Bayes
12
Dealing with Numeric Attributes
Common assumption: attributes have a normal or Gaussian probability distribution (given the class)
The probability density function for the normal distribution is defined by: The sample mean :
The standard deviation :
The density function f(x):
-
Classification – Naïve Bayes
13
An Example: Evaluating the Weather Attributes (Numeric)
Outlook Temp. Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
Classification – Naïve Bayes
14
Statistics for the Weather Data
Example for density value:
- 𝑓 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 66 𝑦𝑒𝑠 =
1
2𝜋6.2𝑒
(66−73)2
2∗6.22 = 0.0340
𝑓 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 99 𝑛𝑜 = 1
2𝜋9.7𝑒
(99−86.2)2
2∗9.72 = 0.0380
Classification – Naïve Bayes
L6: Statistical Classification Approach 15
Classify a New Case
Classify a new case (if any missing values in both training and classifying , omit them)
The case we would
like to predict
16
Probability Densities
Relationship between probability and density:
But: this doesn’t change calculation of a posteriori probabilities because is cancelled out
Exact relationship:
Classification – Naïve Bayes
17
Discussion of Naïve Bayes
Naïve Bayes works surprisingly well
(even if independence assumption is clearly violated)
Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class
However: adding too many redundant attributes will cause problems (e. g. identical attributes)
Note also: many numeric attributes are not normally distributed ( kernel density estimators)
Classification – Naïve Bayes
18
General Bayesian Classification
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Classification – Naïve Bayes
19
Bayesian Theorem Given training data D, posteriori probability of a hypothesis h,
P(h|D) follows the Bayes theorem
MAP (maximum a posteriori) hypothesis
Difficulty: need initial knowledge of many probabilities, significant computational cost
If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis
P(D)P(D|h)P(h)P(h|D)
.)()|(maxarg)|(maxarg hPhDPHh
DhPHhMAP
h
.)|(maxarg)|(maxarg i
i
i hDPHh
DhPHhML
hi
Classification – Naïve Bayes
20
Naïve Bayes Classifiers Assumption: attributes are conditionally independent:
Greatly reduces the computation cost, only count the class distribution. However, it is seldom satisfied in practice, as attributes (variables) are often
correlated.
Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships
between attributes Decision trees, that reason on one attribute at the time, considering most
important attributes first Association rules that reason a class by several attributes
.)|( )(maxarg
})...{|(maxarg
1
,,2,1
i
J
j
ij
i
Ji
i
cPCc
vvvcPCcMAP
c
cvP
Classification – Naïve Bayes
21
Bayesian Belief Network (An Example)
Storm
Lightening
Thunder
BusTourGroup
Campfire
Forestfire
C
~C
(S,B) (S, ~B) (~S, B) (~S, ~B)
0.4
0.6
0.1
0.9
0.8
0.2
0.2
0.8
Also called Bayes Nets
The conditional probability table (CPT) for the variable Campfire
• Network represents a set of conditional
independence assertions.
• Directed acyclic graph
Classification – Naïve Bayes
Attributes (variables) are often correlated. Each variable is conditionally independent given its predecessors
22
C
~C
(S,B) (S, ~B) (~S, B) (~S, ~B)
0.4
0.6
0.1
0.9
0.8
0.2
0.2
0.8
Storm
Lightening
Thunder
BusTourGroup
Campfire
Forestfire
The conditional probability table (CPT)
for the variable Campfire
Bayesian Belief Network (Dependence and Independence)
Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire)
In general,
where Parents(Yi) denotes immediate predecessors of Yi in graph
So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi))
))(|(),...,,(1
21 ii
n
i
n YParentsyPyyyP
0.7 0.85
Classification – Naïve Bayes
23
Bayesian Belief Network (Inference in Bayes Nets)
Infer the values of one or more network variables, given observed values of others Bayes net contains all information needed for this inference
If only one variable with unknown value, it is easy to infer it.
In general case, the problem is NP hard.
Anyway, there are three types of inference.
Top-down inference: p(Campfire|Storm)
Bottom-up inference: p(Storm|Campfire)
Hybrid inference: p(BusTourGroup|Storm,Campfire)
Classification – Naïve Bayes
24
Bayesian Belief Network (Training Bayesian Belief Networks)
Several variants of this learning task Network structure might be known or unknown
Training examples might provide values of all network variables, or just some
If structure known and observe all variables
Then it is easy as training a Naïve Bayes classifier
If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire Use gradient ascent.
Converge to network h that maximize P(D|h)
Classification – Naïve Bayes
25
Numerical Modeling: Regression
Numerical model is used for prediction
Counterparts exist for all schemes that we previously discussed Decision trees, statistical models, etc.
All classification schemes can be applied to regression problems using discretization Prediction: weighted average of intervals’ midpoints
(weighted according to class probabilities)
Regression is more difficult than classification (i. e. percent correct vs. mean squared error)
Prediction – Regression
26
Linear Regression
Work most naturally with numeric attributes
Standard technique for numeric prediction
Outcome is linear combination of attributes
Weights are calculated from the training data
Predicted value for first instance
xwxwxwxw kkj
k
jj
Y)1()1(
11
)1(
00
)1(
0
...
xwxwxwxwxw kkj
k
jj
Y
...221100
0
)1(X
Prediction – Regression
27
Minimize the Squared Error (I) k+1 coefficients are chosen so that the squared error
on the training data is minimized Squared error:
Coefficient can be derived using standard matrix operations Can be done if there are more instances than attributes
(roughly speaking) If there are less instances, a lot of solutions Minimization of absolute error is more difficult!
n
ixwy i
j
k
jj
i
1
2
)(
0
)(
Prediction – Regression
28
Minimize the Squared Error (II)
wXYxwyn
i
ij
k
jj
i
2
1
2
min)(
0
)(min
2
1
0
)()(
1
)(
0
)1()1(
1
)1(
0
)0()0(
1
)0(
0
)(
)1(
)0(
w
ww
xxx
xxxxxx
y
y
y
k
n
k
nn
k
k
n
Xnxk wkx1 Ynx1
min
Prediction – Regression
29
Example : Find the linear regression of salary data
For simplicity, x0 = 1 Therefore, With the method of least square error,
Predicted line is estimated by
Y = 23.55 + 3.5 x1
X = {x1}
Year experience Y
Salary (in $1000s)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Salary data
x = 9.1 and y = 55.4 s = # training instances = 10
xwxwxwxwxw kkj
k
jj
Y
...221100
0
xwwY110
55.23110 xwyw
5.3)(
))((
2
1 1
1 1
1
s
i
i
is
i
i
xx
yyxxw
Prediction for X = 10, Y = 23.55+3.5(10) = 58.55
Prediction – Regression
30
Classification using Linear Regression (One with the others)
Any regression technique can be used for classification
Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not
Prediction: predict class corresponding to model with largest output value(membership value)
For linear regression, this is known as multi-response linear regression
Prediction – Regression
For example, the data has three classes {A, B, C}.
Linear Regression Model 1: predict 1 for class A and 0 for not A
Linear Regression Model 2: predict 1 for class B and 0 for not B
Linear Regression Model 3: predict 1 for class C and 0 for not C
31
Classification using Linear Regression (Pairwise Regression)
Another way of using regression for classification:
A regression function for every pair of classes,using only instances from these two classes
An output of +1 is assigned to one member of the pair, an output of –1 to the other
Prediction is done by voting
Class that receives most votes is predicted
Alternative: “don’t know” if there is no agreement
More likely to be accurate but more expensive
Prediction – Regression
For example, the data has three classes {A, B, C}.
Linear Regression Model 1: predict +1 for class A and -1 for class B
Linear Regression Model 2: predict +1 for class A and -1 for class C
Linear Regression Model 3: predict +1 for class B and -1 for class C
Regression Tree and Model Tree
PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +
0.0056 MMAX + 0.6410 CACH -
0.2700 CHMIN + 1.480 CHMAX
cycle main memory
5
3
208
2
207
209
4
…
1
time
125
29
29
29
29
...
125
480
480
min
256
8000
8000
8000
8000
...
2000
512
1000
max
6000
32000
32000
32000
16000
...
8000
8000
4000
channels
min
16
8
8
8
8
...
2
0
0
max
128
32
32
32
16
...
14
0
0
cache
(Kb)
25
6 32
32
32
32
...
0
32
0
perfor
mace
MYCT MMIN MMAX CHMIN CHMAX CACH PRP
198
269
220
172
132
...
52
67
45
CHMIN
CACH MMAX
MMAX MMAX
CACH 19.3
(28/8.7%)
29.8
(37/8.18%)
19.3
(28/8.7%)
37.3
(19/11.3%)
59.3
(24/16.9%)
MYCT
18.3
(7/3.83%)
75.7
(10/24.6%)
133
(16/28.8%)
157(21/73.7
%)
783
(5/35.9%)
CHMAX
281
(11/56%)
MMIN
492
(7/53.9%)
<=7.5 >7.5
<=8.5 (8.5,28]
>28
<=2500 (2500,4250] >4250 <=10000
>10000
>28000 <=28000
>58 <=58
>12000 <=12000
(0.5,8.5] <=0.5
<=550 >550
CHMIN
CACH MMAX
MMAX
CACH LM1
(65/7.32%)
LM4
(50/22.1%)
LM2
(26/6.37%)
LM3
(24/14.5%)
LM5(21/45.5
%)
LM6
(23/63.5%)
<=7.5 >7.5
<=8.5 >8.5
<=4250 >4250
>28000 <=28000
<=0.5 (0.5,8.5]
LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN
LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN
+ 0.946 CHMAX
LM3: PRP = 38.1 + 0.12 MMIN
LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH
+ 0.969 CHMAX
LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH
- 9.39 CHMIN
LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN
+ 4.98 CHMAX Model Tree
Regression
Tree
Prediction – Regression
Regression tree is a decision tree with averaged
numeric values at the leaves.
Model tree is a tree whose leaves contain linear
regressions.
Support Vector Machine (SVM)
33
SVM is related to statistical learning theory
SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher
SVM becomes popular because of its success in handwritten digit recognition
1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.
SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning.
SVM is popularly used in classification task
Support Vector Machines
34
What is a good Decision Boundary? A two-class, linearly
separable classification
problem
Many decision boundaries!
The Perceptron algorithm can
be used to find such a
boundary
Different algorithms have been
proposed
Are all decision boundaries
equally good?
Class 1
Class 2
Support Vector Machines
35
Examples of Bad Decision Boundaries
Class 1
Class 2
BEST
Class 1
Class 2
Support Vector Machines
Support Vector Machines 36
Large-margin Decision Boundary
The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Distance between the origin and the line wtx=k is
k/||w||
||||
2
wm
Class 1
Class 2
m 1bxwT
0bxwT
1bxwT
w
Support Vector Machines 37
Example
||||
2
wm
1bxwTClass 1
Class 2
m
0bxwT
1bxwT
w
1 2 3 4 5 6 7 0
1
0
2
3
4
5
6
7 2
23
5
3/2
3/2
m
b
w
13
6
12
4
14
2
21
21
21
bww
bww
bww 111324
64221
bbbww
Distance between 2 hyperplanes
supports
Support Vector Machines 38
Example ||||
2
wm
1bxwTClass 1
Class 2
m
0bxwT
1bxwT
w
1 2 3 4 5 6 7 0
1
0
2
3
4
5
6
7
Best boundary:
Solve => maximize m
or minimize 𝒘
As we also want to prevent data points
falling into the margin, we add the following
constraints for each point i,
𝑤𝑇𝑥𝑖 + 𝑏 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠
and
𝑤𝑇𝑥𝑖 + 𝑏 − 1 𝑓𝑜𝑟 𝑥𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠
For n point, this can be rewriten as:
=> 𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏
Support Vector Machines 39
Primal form
Class 1
Class 2
m
w
1 2 3 4 5 6 7 0
1
0
2
3
4
5
6
7
Previously, it is difficult to solve because it depends on ||w||, the norm of w, which involves a square root
We alter the equation by
substituting ||w|| with 1
2𝑤 2 (the
factor of 1/2 being used for mathematical convenience)
This is called “Quadratic programming (QP) optimization” problem
Minimize in (w, b)
𝟏
𝟐𝒘 𝟐
subject to (for any i = 1, …, n)
𝒚𝒊 𝒘𝑻𝒙𝒊 + 𝒃 ≥ 𝟏 𝒇𝒐𝒓 𝒂𝒍𝒍 𝟏 ≤ 𝒊 ≤ 𝒏
How to solve this optimization and more information on SVM, e.g., dual form, kernel, can be found in the ref [1]
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
Cambridge University Press, 2000. http://www.support-vector.net
40
Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear
decision boundary
How to generalize it to become nonlinear?
Key idea: transform xi to a higher dimensional space to “make life easier”
Input space: the space the point xi are located
Feature space: the space of f(xi) after transformation
Why transform?
Linear operation in the feature space is equivalent to non-linear operation in input space
Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable
Support Vector Machines
Support Vector Machines 41
Transforming the Data
Computation in the feature space can be costly because it is high dimensional
The feature space is typically infinite-dimensional!
The kernel trick can help (more info. in ref [1])
f( )
f( )
f( ) f( ) f( )
f( )
f( ) f( )
f(.) f( )
f( )
f( )
f( ) f( )
f( )
f( )
f( ) f( )
f( )
Feature space Input space Note: feature space is of higher dimension
than the input space in practice
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
Cambridge University Press, 2000. http://www.support-vector.net
42
Why SVM Work? The feature space is often very high dimensional. Why don’t we have
the curse of dimensionality?
A classifier in a high-dimensional space has many parameters and is hard to estimate
Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier
Typically, a classifier with many parameters is very flexible, but there are also exceptions
Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible combination of class labels on xi
This 1-parameter classifier is very flexible
Support Vector Machines
43
Why SVM works? Vapnik argues that the flexibility of a classifier should not be
characterized by the number of parameters, but by the flexibility (capacity) of a classifier
This is formalized by the “VC-dimension” of a classifier
Consider a linear classifier in two-dimensional space
If we have three training data points, no matter how those points are labeled, we can classify them perfectly
Support Vector Machines
44
VC-dimension
However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect
We can see that 3 is the critical number
The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible
Support Vector Machines
45
Other Aspects of SVM How to use SVM for multi-class classification?
Original SVM is for binary classification
One can change the QP formulation to become multi-class
More often, multiple binary classifiers are combined
One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently”
How to interpret the SVM discriminant function value as probability?
By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training
Some SVM software (like libsvm) have these features built-in
A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle multi-class classification
SVMLight is among one of the earliest implementation of SVM
Several Matlab toolboxes for SVM are also available
Support Vector Machines
46
Strengths and Weaknesses of SVM
Strengths
Training is relatively easy
No local optimal, unlike in neural networks
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can be controlled explicitly
Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
Weaknesses
Need to choose a “good” kernel function.
Support Vector Machines
47
Example: Predicting a class label using naïve Bayesian classification
RID age income student Credit_rating Class:buys_computer
1 <=30 High No Fair No
2 <=30 High No Excellent No
3 31…40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31…40 Low Yes Excellent Yes
8 <=30 Medium No Fair no
9 <=30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <=30 Medium Yes Excellent Yes
12 31…40 Medium No Excellent Yes
13 31…40 High Yes Fair Yes
14 >40 medium no Excellent No
15 <=30 medium yes fair
Unknown sample
Data Warehousing and Data Mining by Kritsada Sriphaew
48
Outlook Temperature Humidity Windy Play
Sunny Hot High False N
Sunny Hot High True N
Overcast Hot High False Y
Rainy Mild High False Y
Rainy Cool Normal False Y
Rainy Cool Normal True N
Overcast Cool Normal True Y
Sunny Mild high False N
Sunny Cool Normal False Y
Rainy Mild Normal False Y
Sunny Mild Normal True Y
Overcast Hot Normal False Y
Overcast Mild High True Y
Rainy Mild High True N
Sunny Cool Normal False
Rainy Mild High False
Exercise:
Using naïve Bayesain classifier to predict those unknown data
samples
Unknown data samples