Classification DMM
-
Upload
mostafa-heidary -
Category
Documents
-
view
229 -
download
0
Transcript of Classification DMM
-
7/30/2019 Classification DMM
1/50
Data Mining 1
1
Jalali.mshdiau.ac.ir
-
7/30/2019 Classification DMM
2/50
Data Mining 2
Association Rules
2
-
7/30/2019 Classification DMM
3/50
3
What Is Classification?
The goal of data classification is to organize and
categorize data in distinct classes
A model is first created based on the data distribution
The model is then used to classify new data
Given the model, a class can be predicted for new data
Classification = prediction for discrete and nominal
values
With classification, I can predict in which bucket to put the ball,but I cant predict the weight of the ball
-
7/30/2019 Classification DMM
4/50
4
Prediction, Clustering, Classification
What is Prediction? The goal of prediction is to forecast or deduce the value of an attribute
based on values of other attributes
models continuous-valued functions, i.e., predicts unknown or missing values
A model is first created based on the data distribution
The model is then used to predict future or unknown values
Supervised vs. Unsupervised Classification Supervised Classification = Classification
We know the class labels and the number ofclasses
Unsupervised Classification = Clustering
We do not know the class labels and may not knowthe number of classes
-
7/30/2019 Classification DMM
5/50
5
Classification and Prediction
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
-
7/30/2019 Classification DMM
6/50
6
Classification: 3 Step Process
1. Model construction (Learning): Each record (instance) is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label
The set of all records used for construction of the model is called training set
The model is usually represented in the form of classification rules, (IF-THENstatements) or decision trees
2. Model Evaluation (Accuracy): Estimate accuracy rate of the model based on a test set
The known label of test sample is compared with the classified result from model
Accuracy rate: percentage of test set samples correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
3. Model Use (Classification): The model is used to classify unseen instances (assigning class labels)
Predict the value of an actual attribute
-
7/30/2019 Classification DMM
7/50
7
Process (1): Model Construction
Training
Data
N A M E R A N K Y E A R S T E N U R E D
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yesBill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Pro f 3 no
Classification
Algorithms
IF rank = professor
OR years > 6
THEN tenured = yes
Classifier
(Model)
-
7/30/2019 Classification DMM
8/50
8
Process (2): Using the Model in Prediction
Classifier
Testing
Data
N A M E R A N K Y E A R S T E N U R E D
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
-
7/30/2019 Classification DMM
9/50
9
Accuracy of Supervised Learning
Holdout Approach
A certain amount of data (Usually, one-third) for testing and
remainder (two-third) for training
K-Fold cross validation
Split the data into k subsets of equal size, then each subset is turn is
used for testing and remainder for training
-
7/30/2019 Classification DMM
10/50
10
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes Data transformation
Normalize data
-
7/30/2019 Classification DMM
11/50
11
Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time) time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision treesize or compactness of classification rules
-
7/30/2019 Classification DMM
12/50
12
Other Issues
Function complexity and amount of training data
If the true function is simple then an learning algorithm will be able to
learn it from a small amount of data, otherwise the function will only
be learnable from a very large amount of training data.
Dimensionality of the input space Noise in the output values
Heterogeneity of the data.
If the feature vectors include features of many different kinds
(discrete, discrete ordered, counts, continuous values), somealgorithms are easier to apply than others.
Dependency of features
-
7/30/2019 Classification DMM
13/50
13
Decision Tree - Review of Basics
What exactly is a Decision Tree?
A tree where each branching node represents a choice between two ormore alternatives, with every branching node being part of a path to aleaf node (bottom of the tree). The leaf node represents a decision,
derived from the tree for the given input.
How can Decision Trees be used to classify instances of data?
Instead of representing decisions, leaf nodes represent a particularclassification of a data instance, based on the given set of attributes (andtheir discrete values) that define the instance of data, which is kind of likea relational tuple for illustration purposes.
-
7/30/2019 Classification DMM
14/50
14
Review of Basics (contd)
It is important that data instances have boolean or discrete
data values for their attributes to help with the basic
understanding of ID3, although there are extensions of ID3
that deal with continuous data.
-
7/30/2019 Classification DMM
15/50
15
How does ID3 relate to Decision Trees, then?
ID3, or Iterative Dichotomiser 3 Algorithm, is a Decision Tree learning
algorithm. The name is correct in that it creates Decision Trees for
dichotomizing data instances, or classifying them discretely through
branching nodes until a classification bucket is reached (leaf node).
By using ID3 and other machine-learning algorithms from Artificial
Intelligence, expert systems can engage in tasks usually done by human
experts, such as doctors diagnosing diseases by examining various
symptoms (the attributes) of patients (the data instances) in a complex
Decision Tree.
Of course, accurate Decision Trees are fundamental to Data Mining and
Databases.
-
7/30/2019 Classification DMM
16/50
16
Other Decision Trees
ID3
J48
C4.5
SLIQ SPRINT
PUBLIC
RainForest
BOAT
Data Cube-based Decision Tree
-
7/30/2019 Classification DMM
17/50
17
Tree Presentation
-
7/30/2019 Classification DMM
18/50
18
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
40 low yes fair yes>40 low yes excellent no
3140 low yes excellent yes
-
7/30/2019 Classification DMM
19/50
19
age?
overcast
student? credit rating?
40
no yes yes
yes
31..40
fairexcellentyesno
Output: A Decision Tree for buys_computer
-
7/30/2019 Classification DMM
20/50
20
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance) Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioningmajority voting is
employed for classifying the leaf
There are no samples left
-
7/30/2019 Classification DMM
21/50
21
Entropy
8
2
P
8
6P )
8
6log
8
6()
8
2log
8
2( 22 E
S is the training example
is the proportion of positive examples in S
is the proportion of negative examples in S
Entropy measures the impurity of S
For Non-Boolean classification :
-
7/30/2019 Classification DMM
22/50
22
Gain (S,A) = Expected reduction in entropy due to sorting on A
Attribute Selection Measure: Information Gain
(ID3/C4.5)
-
7/30/2019 Classification DMM
23/50
23
Class P: buys_computer = yes
Class N: buys_computer = no
Attribute Selection: Information Gain
)()()()( 4040
40..31
40..31
30
30
)(
SEntropyS
SSEntropy
S
SSEntropy
S
SSEntropy
S
sv
Avaluev
v
age income student credit_rating buys_computer
40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
-
7/30/2019 Classification DMM
24/50
24
Attribute Selection: Information Gain
048.0)_,(
151.0),(
029.0),(
ratingcreditSGain
studentSGain
incomeSGain
246.0694.0940.0),( ageSGain
age income student credit_rating buys_computer
40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
-
7/30/2019 Classification DMM
25/50
25
ID3 Algorithm
-
7/30/2019 Classification DMM
26/50
26
Computing Information-Gain for Continuous-
Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split pointfor A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as
a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirementfor A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples
in D satisfying A > split-point
-
7/30/2019 Classification DMM
27/50
27
Gain Ratio for Attribute Selection and Cost of Attributes
What happens if you choose Day as root ??
Attribute with Cost
-
7/30/2019 Classification DMM
28/50
28
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction earlydo not split a node if this would result in the
goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown treeget a sequence of progressively
pruned trees
Use a set of data different from the training data to decide which is the best pruned tree
-
7/30/2019 Classification DMM
29/50
29
Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous
attribute value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are sparsely represented
This reduces fragmentation, repetition, and replication
-
7/30/2019 Classification DMM
30/50
30
Instance Based Learning
Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
Eager learning (eg. Decision trees, SVM, NN): Given a set of
training set, constructs a classification model before receiving
new (e.g., test) data to classify
-
7/30/2019 Classification DMM
31/50
31
Instance-based Learning
Its very similar to aDesktop!!
-
7/30/2019 Classification DMM
32/50
32
Example
Image Scene Classification
-
7/30/2019 Classification DMM
33/50
33
Instance-based Learning
When To Consider IBL Instances map to points
Less than 20 attributes per instance
Lots of training data
Advantages:
Learn complex target functions(Class)
Don't lose information
Disadvantages:
Slow at query time
Easily fooled by irrelevant attributes
-
7/30/2019 Classification DMM
34/50
34
K-Nearest Neighbor Learning (k-NN)
k-NN is a type of instance-based learning, or lazy learning
where the function is only approximated locally and all
computation is deferred until classification.
The k-nearest neighbor algorithm is amongst the simplest of
all machine learning algorithms: an object is classified by amajority vote of its neighbors, with the object being assigned
to the class most common amongst its k nearest neighbors (k
is a positive integer, typically small).
If k = 1, then the object is simply assigned to the class of its nearestneighbor.
-
7/30/2019 Classification DMM
35/50
35
K-Nearest Neighbor Learning (k-NN)- Cont.
Usually Euclidean distance is used as the distance metric;
however this is only applicable to continuous variables. In
cases such as text classification, another metric such as the
overlap metric (or Hamming distance: it measures the minimum
number ofsubstitutions required to change one string into the other, or thenumber oferrors that transformed one string into the other. ) can be used.
-
7/30/2019 Classification DMM
36/50
36
Distance or Similarity Measures
Common Distance Measures:
Manhattan distance:
Euclidean distance:
Cosine similarity:
( , ) 1 ( , )dist X Y sim X Y 2 2
( )
( , )i i
i
i ii i
x y
sim X Yx y
-
7/30/2019 Classification DMM
37/50
37
KNN - Algorithm
Key Idea: Just store all training examples
Nearest Neighbour :
Given query instance Xq , First locate nearest training example Xn ,
Then estimate
K Nearest Neighbour :
Given xq , take vote among its K nearest nbrs ( If discrete-value for
class)
Take mean offValues of k nearest nbrs (if real-value)
-
7/30/2019 Classification DMM
38/50
38
Figure K Nearest Neighbors Example
X
Stored training set patterns
X input pattern for classification
--- Euclidean distance measure to
the nearest three patterns
-
7/30/2019 Classification DMM
39/50
39
twoone
four
three
five six
seven Eight ?
Which one belongs to Mondrian ?
-
7/30/2019 Classification DMM
40/50
40
Training data
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No
Number Lines Line types Rectangles Colours Mondrian?
8 7 2 9 4
Test instance
-
7/30/2019 Classification DMM
41/50
41
Normalization
Min-max normalization: to [new_minA, new_maxA]
Z-score normalization (: mean, : standard deviation):
Normalization by decimal scaling
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Where j is the smallest integer such that Max(||) < 1
-
7/30/2019 Classification DMM
42/50
42
Normalised training data
Number Lines Linetypes
Rectangles Colours Mondrian?
1 0.632 -0.632 0.327 -1.021 No
2 -1.581 1.581 -0.588 0.408 No
3 -0.474 1.581 -1.046 -1.021 Yes
4 -0.474 -0.632 -0.588 -1.021 Yes
5 -0.474 -0.632 0.327 0.408 No
6 0.632 -0.632 -0.588 1.837 Yes
7 1.739 -0.632 2.157 0.408 No
Number Lines Linetypes
Rectangles Colours Mondrian?
8 1.739 1.581 -0.131 -1.021
Test instance
-
7/30/2019 Classification DMM
43/50
43
Distances of test instance from training data
Example Distanceof testfromexample
Mondrian?
1 2.517 No
23.644
No
3 2.395 Yes
4 3.164 Yes
5 3.472 No
6 3.808 Yes
7 3.490 No
Classification
1-NN Yes
3-NN Yes
5-NN No
7-NN No
-
7/30/2019 Classification DMM
44/50
44
Example
Classify Theatre ?
-
7/30/2019 Classification DMM
45/50
45
Nearest neighbors algorithms: illustration
++
++
-
-
-
-
-
-
e1
1-nearest neighbor:
the concept represented by e1
5-nearest neighbors:
q1 is classified as negative
q1
-
7/30/2019 Classification DMM
46/50
46
Voronoi diagram
querypoint qf
nearest neighbor qi
-
7/30/2019 Classification DMM
47/50
47
Variant of kNN: Distance-Weighted kNN
We might want to weight nearer neighbors more heavily
Then it makes sense to use all training examples instead of just k
2
1
1
),(
1where
)(:)(
iq
ik
i i
k
i ii
qd
ww
fwf
xx
x
x
-
7/30/2019 Classification DMM
48/50
48
Difficulties with k-nearest neighbour
algorithms
Have to calculate the distance of the test case from all
training cases
There may be irrelevant attributes amongst the attributes
curse of dimensionality
For instance , we have 20 attribute, however 2 of them are
appropriate attributes , non-relevant attributes effect to
results
Solution : Assign appropriate Weight to the attributes by
using cross-validation method.
-
7/30/2019 Classification DMM
49/50
49
How to choose k
Large k:
less sensitive to noise (particularly class noise)
better probability estimates for discrete classes
larger training sets allow larger values of k
Small k: captures fine structure of problem space better
may be necessary with small training sets
Balance must be struck between large and small k
As training set approaches infinity, and k grows large, kNNbecomes Bayes optimal
-
7/30/2019 Classification DMM
50/50
1. Use Weka to classify and evaluate Letter Recognition dataset through IBL2. Write down 3 new challenges in KNN