CES 514 – Data Mining Lecture 8 classification (contd…)
-
date post
20-Dec-2015 -
Category
Documents
-
view
227 -
download
2
Transcript of CES 514 – Data Mining Lecture 8 classification (contd…)
![Page 1: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/1.jpg)
CES 514 – Data Mining Lecture 8
classification (contd…)
![Page 2: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/2.jpg)
Example: PEBLS
PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)– Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)
– Each record is assigned a weight factor
– Number of nearest neighbor, k = 1
![Page 3: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/3.jpg)
Example: PEBLS
Class
Marital Status
Single Married
Divorced
Yes 2 0 1
No 2 4 1
i
ii
n
n
n
nVVd
2
2
1
121 ),(
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
d(Single,Divorced)
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
d(Married,Divorced)
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Class
Refund
Yes No
Yes 0 3
No 3 4
![Page 4: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/4.jpg)
Example: PEBLS
d
iiiYX YXdwwYX
1
2),(),(
Tid Refund Marital Status
Taxable Income Cheat
X Yes Single 125K No
Y No Married 100K No 10
Distance between record X and record Y:
where:
correctly predicts X timesofNumber
predictionfor used is X timesofNumber Xw
wX 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
![Page 5: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/5.jpg)
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
![Page 6: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/6.jpg)
Support Vector Machines
One Possible Solution
B1
![Page 7: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/7.jpg)
Support Vector Machines
Another possible solution
B2
![Page 8: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/8.jpg)
Support Vector Machines
Other possible solutions
B2
![Page 9: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/9.jpg)
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
![Page 10: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/10.jpg)
Support Vector Machines
Find hyperplane maximizes the margin (e.g. B1 is better than B2.)
B1
B2
b11
b12
b21
b22
margin
![Page 11: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/11.jpg)
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Margin
w
![Page 12: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/12.jpg)
Support Vector Machines
We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
2||||
2 Margin
w
1bxw if1
1bxw if1)(
i
i
ixf
2
||||)(
2wwL
![Page 13: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/13.jpg)
Overview of optimization
Simplest optimization problem: Maximize f(x) (one variable)
If the function has nice properties (such as differentiable), then we can use calculus to solve the problem. solve equation f’(x) = 0. Suppose a root is a. Then if f’’(a) < 0 then a is a maximum.
Tricky issues: • How to solve the equation f’(x) = 0?• what if there are many solutions? Each is a “local” optimum.
![Page 14: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/14.jpg)
How to solve g(x) = 0
Even polynomial equations are very hard to solve.
Quadratic has a closed-form. What about higher-degrees?
Numerical techniques: (iteration)• bisection• secant• Newton-Raphson etc.
Challenges:• initial guess• rate of convergence?
![Page 15: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/15.jpg)
Functions of several variables
Consider equation such as F(x,y) = 0
To find the maximum of F(x,y), we solve the equations
and 0xF 0
yF
If we can solve this system of equations, then we have found a local maximum or minimum of F.
We can solve the equations using numerical techniques similar to the one-dimensional case.
![Page 16: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/16.jpg)
When is the solution maximum or minimum?
• Hessian:
• if the Hessian is positive definite in the neighborhood of a, then a is a minimum.
• if the Hessian is negative definite in the neighborhood of a, then a is a maximum.
• if it is neither, then a is a saddle point.
2
22
2
2
),,(
yf
xyf
yxf
xf
yxfH
![Page 17: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/17.jpg)
Application - linear regression
Problem: given (x1,y1), … (xn, yn), find the best linear relation between x and y.
Assume y = Ax + B. To find A and B, we will minimize
Since this is a function of two variables, we can solve by setting and
2
1
)(),( BAxyBAE j
n
jj
0BE
0AE
![Page 18: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/18.jpg)
Constrained optimization
Maximize f(x,y) subject to g(x,y) = c
Using Lagrange multiplier, the problem is formulated as maximizing:
h(x,y) = f(x,y) + (g(x,y) – c)
Now, solve the equations:
0
h
y
h
x
h
![Page 19: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/19.jpg)
Support Vector Machines (contd)
What if the problem is not linearly separable?
![Page 20: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/20.jpg)
Support Vector Machines
What if the problem is not linearly separable?– Introduce slack variables
Need to minimize:
Subject to:
ii
ii
1bxw if1
-1bxw if1)(
ixf
N
i
kiC
wwL
1
2
2
||||)(
![Page 21: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/21.jpg)
Nonlinear Support Vector Machines
What if decision boundary is not linear?
![Page 22: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/22.jpg)
Nonlinear Support Vector Machines
Transform data into higher dimensional space
![Page 23: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/23.jpg)
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal to 1.
![Page 24: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/24.jpg)
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Outputnode
Inputnodes
otherwise0
trueis if1)( where
)04.03.03.03.0( 321
zzI
XXXIY
![Page 25: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/25.jpg)
Artificial Neural Networks (ANN)
Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
X1
X2
X3
Y
Black box
w1
t
Outputnode
Inputnodes
w2
w3
)( tXwIYi
ii Perceptron Model
)( tXwsignYi
ii
or
![Page 26: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/26.jpg)
General Structure of ANN
Activationfunction
g(Si )Si Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
InputLayer
HiddenLayer
OutputLayer
x1 x2 x3 x4 x5
y
Training ANN means learning the weights of the neurons
![Page 27: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/27.jpg)
Algorithm for learning ANN
Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples– Objective function:
– Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm
2),( i
iii XwfYE
![Page 28: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/28.jpg)
WEKA
![Page 29: CES 514 – Data Mining Lecture 8 classification (contd…)](https://reader035.fdocuments.net/reader035/viewer/2022062308/56649d445503460f94a21791/html5/thumbnails/29.jpg)
WEKA implementations
WEKA has implementation of all the major data mining algorithms including:
• decision trees (CART, C4.5 etc.)• naïve Bayes algorithm and all variants• nearest neighbor classifier• linear classifier• Support Vector Machine • clustering algorithms• boosting algorithms etc.