Classification DMM

download Classification DMM

of 50

Transcript of Classification DMM

  • 7/30/2019 Classification DMM

    1/50

    Data Mining 1

    1

    [email protected]

    Jalali.mshdiau.ac.ir

  • 7/30/2019 Classification DMM

    2/50

    Data Mining 2

    Association Rules

    2

  • 7/30/2019 Classification DMM

    3/50

    3

    What Is Classification?

    The goal of data classification is to organize and

    categorize data in distinct classes

    A model is first created based on the data distribution

    The model is then used to classify new data

    Given the model, a class can be predicted for new data

    Classification = prediction for discrete and nominal

    values

    With classification, I can predict in which bucket to put the ball,but I cant predict the weight of the ball

  • 7/30/2019 Classification DMM

    4/50

    4

    Prediction, Clustering, Classification

    What is Prediction? The goal of prediction is to forecast or deduce the value of an attribute

    based on values of other attributes

    models continuous-valued functions, i.e., predicts unknown or missing values

    A model is first created based on the data distribution

    The model is then used to predict future or unknown values

    Supervised vs. Unsupervised Classification Supervised Classification = Classification

    We know the class labels and the number ofclasses

    Unsupervised Classification = Clustering

    We do not know the class labels and may not knowthe number of classes

  • 7/30/2019 Classification DMM

    5/50

    5

    Classification and Prediction

    Typical applications

    Credit approval

    Target marketing

    Medical diagnosis

    Fraud detection

  • 7/30/2019 Classification DMM

    6/50

    6

    Classification: 3 Step Process

    1. Model construction (Learning): Each record (instance) is assumed to belong to a predefined class, as

    determined by one of the attributes, called the class label

    The set of all records used for construction of the model is called training set

    The model is usually represented in the form of classification rules, (IF-THENstatements) or decision trees

    2. Model Evaluation (Accuracy): Estimate accuracy rate of the model based on a test set

    The known label of test sample is compared with the classified result from model

    Accuracy rate: percentage of test set samples correctly classified by the model

    Test set is independent of training set otherwise over-fitting will occur

    3. Model Use (Classification): The model is used to classify unseen instances (assigning class labels)

    Predict the value of an actual attribute

  • 7/30/2019 Classification DMM

    7/50

    7

    Process (1): Model Construction

    Training

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Mike Assistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yes

    Jim Associate Prof 7 yes

    Dave Assistant Prof 6 no

    Anne Associate Pro f 3 no

    Classification

    Algorithms

    IF rank = professor

    OR years > 6

    THEN tenured = yes

    Classifier

    (Model)

  • 7/30/2019 Classification DMM

    8/50

    8

    Process (2): Using the Model in Prediction

    Classifier

    Testing

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Tom Assistant Prof 2 no

    Merlisa Associate Prof 7 no

    George Professor 5 yes

    Joseph Assistant Prof 7 yes

    Unseen Data

    (Jeff, Professor, 4)

    Tenured?

  • 7/30/2019 Classification DMM

    9/50

    9

    Accuracy of Supervised Learning

    Holdout Approach

    A certain amount of data (Usually, one-third) for testing and

    remainder (two-third) for training

    K-Fold cross validation

    Split the data into k subsets of equal size, then each subset is turn is

    used for testing and remainder for training

  • 7/30/2019 Classification DMM

    10/50

    10

    Issues: Data Preparation

    Data cleaning

    Preprocess data in order to reduce noise and handle missing values

    Relevance analysis (feature selection)

    Remove the irrelevant or redundant attributes Data transformation

    Normalize data

  • 7/30/2019 Classification DMM

    11/50

    11

    Issues: Evaluating Classification Methods

    Accuracy

    classifier accuracy: predicting class label

    predictor accuracy: guessing value of predicted attributes

    Speed

    time to construct the model (training time) time to use the model (classification/prediction time)

    Robustness: handling noise and missing values

    Scalability: efficiency in disk-resident databases

    Interpretability understanding and insight provided by the model

    Other measures, e.g., goodness of rules, such as decision treesize or compactness of classification rules

  • 7/30/2019 Classification DMM

    12/50

    12

    Other Issues

    Function complexity and amount of training data

    If the true function is simple then an learning algorithm will be able to

    learn it from a small amount of data, otherwise the function will only

    be learnable from a very large amount of training data.

    Dimensionality of the input space Noise in the output values

    Heterogeneity of the data.

    If the feature vectors include features of many different kinds

    (discrete, discrete ordered, counts, continuous values), somealgorithms are easier to apply than others.

    Dependency of features

  • 7/30/2019 Classification DMM

    13/50

    13

    Decision Tree - Review of Basics

    What exactly is a Decision Tree?

    A tree where each branching node represents a choice between two ormore alternatives, with every branching node being part of a path to aleaf node (bottom of the tree). The leaf node represents a decision,

    derived from the tree for the given input.

    How can Decision Trees be used to classify instances of data?

    Instead of representing decisions, leaf nodes represent a particularclassification of a data instance, based on the given set of attributes (andtheir discrete values) that define the instance of data, which is kind of likea relational tuple for illustration purposes.

  • 7/30/2019 Classification DMM

    14/50

    14

    Review of Basics (contd)

    It is important that data instances have boolean or discrete

    data values for their attributes to help with the basic

    understanding of ID3, although there are extensions of ID3

    that deal with continuous data.

  • 7/30/2019 Classification DMM

    15/50

    15

    How does ID3 relate to Decision Trees, then?

    ID3, or Iterative Dichotomiser 3 Algorithm, is a Decision Tree learning

    algorithm. The name is correct in that it creates Decision Trees for

    dichotomizing data instances, or classifying them discretely through

    branching nodes until a classification bucket is reached (leaf node).

    By using ID3 and other machine-learning algorithms from Artificial

    Intelligence, expert systems can engage in tasks usually done by human

    experts, such as doctors diagnosing diseases by examining various

    symptoms (the attributes) of patients (the data instances) in a complex

    Decision Tree.

    Of course, accurate Decision Trees are fundamental to Data Mining and

    Databases.

  • 7/30/2019 Classification DMM

    16/50

    16

    Other Decision Trees

    ID3

    J48

    C4.5

    SLIQ SPRINT

    PUBLIC

    RainForest

    BOAT

    Data Cube-based Decision Tree

  • 7/30/2019 Classification DMM

    17/50

    17

    Tree Presentation

  • 7/30/2019 Classification DMM

    18/50

    18

    Decision Tree Induction: Training Dataset

    age income student credit_rating buys_computer

    40 low yes fair yes>40 low yes excellent no

    3140 low yes excellent yes

  • 7/30/2019 Classification DMM

    19/50

    19

    age?

    overcast

    student? credit rating?

    40

    no yes yes

    yes

    31..40

    fairexcellentyesno

    Output: A Decision Tree for buys_computer

  • 7/30/2019 Classification DMM

    20/50

    20

    Algorithm for Decision Tree Induction

    Basic algorithm (a greedy algorithm)

    Tree is constructed in a top-down recursive divide-and-conquer manner

    At start, all the training examples are at the root

    Attributes are categorical (if continuous-valued, they are discretized in

    advance) Examples are partitioned recursively based on selected attributes

    Test attributes are selected on the basis of a heuristic or statistical measure

    (e.g., information gain)

    Conditions for stopping partitioning

    All samples for a given node belong to the same class There are no remaining attributes for further partitioningmajority voting is

    employed for classifying the leaf

    There are no samples left

  • 7/30/2019 Classification DMM

    21/50

    21

    Entropy

    8

    2

    P

    8

    6P )

    8

    6log

    8

    6()

    8

    2log

    8

    2( 22 E

    S is the training example

    is the proportion of positive examples in S

    is the proportion of negative examples in S

    Entropy measures the impurity of S

    For Non-Boolean classification :

  • 7/30/2019 Classification DMM

    22/50

    22

    Gain (S,A) = Expected reduction in entropy due to sorting on A

    Attribute Selection Measure: Information Gain

    (ID3/C4.5)

  • 7/30/2019 Classification DMM

    23/50

    23

    Class P: buys_computer = yes

    Class N: buys_computer = no

    Attribute Selection: Information Gain

    )()()()( 4040

    40..31

    40..31

    30

    30

    )(

    SEntropyS

    SSEntropy

    S

    SSEntropy

    S

    SSEntropy

    S

    sv

    Avaluev

    v

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low yes excellent no

    3140 low yes excellent yes

  • 7/30/2019 Classification DMM

    24/50

    24

    Attribute Selection: Information Gain

    048.0)_,(

    151.0),(

    029.0),(

    ratingcreditSGain

    studentSGain

    incomeSGain

    246.0694.0940.0),( ageSGain

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low yes excellent no

    3140 low yes excellent yes

  • 7/30/2019 Classification DMM

    25/50

    25

    ID3 Algorithm

  • 7/30/2019 Classification DMM

    26/50

    26

    Computing Information-Gain for Continuous-

    Value Attributes

    Let attribute A be a continuous-valued attribute

    Must determine the best split pointfor A

    Sort the value A in increasing order

    Typically, the midpoint between each pair of adjacent values is considered as

    a possible split point

    (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

    The point with the minimum expected information requirementfor A is

    selected as the split-point for A

    Split:

    D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples

    in D satisfying A > split-point

  • 7/30/2019 Classification DMM

    27/50

    27

    Gain Ratio for Attribute Selection and Cost of Attributes

    What happens if you choose Day as root ??

    Attribute with Cost

  • 7/30/2019 Classification DMM

    28/50

    28

    Overfitting and Tree Pruning

    Overfitting: An induced tree may overfit the training data

    Too many branches, some may reflect anomalies due to noise or outliers

    Poor accuracy for unseen samples

    Two approaches to avoid overfitting

    Prepruning: Halt tree construction earlydo not split a node if this would result in the

    goodness measure falling below a threshold

    Difficult to choose an appropriate threshold

    Postpruning: Remove branches from a fully grown treeget a sequence of progressively

    pruned trees

    Use a set of data different from the training data to decide which is the best pruned tree

  • 7/30/2019 Classification DMM

    29/50

    29

    Enhancements to Basic Decision Tree Induction

    Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous

    attribute value into a discrete set of intervals

    Handle missing attribute values

    Assign the most common value of the attribute

    Assign probability to each of the possible values

    Attribute construction

    Create new attributes based on existing ones that are sparsely represented

    This reduces fragmentation, repetition, and replication

  • 7/30/2019 Classification DMM

    30/50

    30

    Instance Based Learning

    Lazy learning (e.g., instance-based learning): Simply stores

    training data (or only minor processing) and waits until it is

    given a test tuple

    Eager learning (eg. Decision trees, SVM, NN): Given a set of

    training set, constructs a classification model before receiving

    new (e.g., test) data to classify

  • 7/30/2019 Classification DMM

    31/50

    31

    Instance-based Learning

    Its very similar to aDesktop!!

  • 7/30/2019 Classification DMM

    32/50

    32

    Example

    Image Scene Classification

  • 7/30/2019 Classification DMM

    33/50

    33

    Instance-based Learning

    When To Consider IBL Instances map to points

    Less than 20 attributes per instance

    Lots of training data

    Advantages:

    Learn complex target functions(Class)

    Don't lose information

    Disadvantages:

    Slow at query time

    Easily fooled by irrelevant attributes

  • 7/30/2019 Classification DMM

    34/50

    34

    K-Nearest Neighbor Learning (k-NN)

    k-NN is a type of instance-based learning, or lazy learning

    where the function is only approximated locally and all

    computation is deferred until classification.

    The k-nearest neighbor algorithm is amongst the simplest of

    all machine learning algorithms: an object is classified by amajority vote of its neighbors, with the object being assigned

    to the class most common amongst its k nearest neighbors (k

    is a positive integer, typically small).

    If k = 1, then the object is simply assigned to the class of its nearestneighbor.

  • 7/30/2019 Classification DMM

    35/50

    35

    K-Nearest Neighbor Learning (k-NN)- Cont.

    Usually Euclidean distance is used as the distance metric;

    however this is only applicable to continuous variables. In

    cases such as text classification, another metric such as the

    overlap metric (or Hamming distance: it measures the minimum

    number ofsubstitutions required to change one string into the other, or thenumber oferrors that transformed one string into the other. ) can be used.

  • 7/30/2019 Classification DMM

    36/50

    36

    Distance or Similarity Measures

    Common Distance Measures:

    Manhattan distance:

    Euclidean distance:

    Cosine similarity:

    ( , ) 1 ( , )dist X Y sim X Y 2 2

    ( )

    ( , )i i

    i

    i ii i

    x y

    sim X Yx y

  • 7/30/2019 Classification DMM

    37/50

    37

    KNN - Algorithm

    Key Idea: Just store all training examples

    Nearest Neighbour :

    Given query instance Xq , First locate nearest training example Xn ,

    Then estimate

    K Nearest Neighbour :

    Given xq , take vote among its K nearest nbrs ( If discrete-value for

    class)

    Take mean offValues of k nearest nbrs (if real-value)

  • 7/30/2019 Classification DMM

    38/50

    38

    Figure K Nearest Neighbors Example

    X

    Stored training set patterns

    X input pattern for classification

    --- Euclidean distance measure to

    the nearest three patterns

  • 7/30/2019 Classification DMM

    39/50

    39

    twoone

    four

    three

    five six

    seven Eight ?

    Which one belongs to Mondrian ?

  • 7/30/2019 Classification DMM

    40/50

    40

    Training data

    Number Lines Line types Rectangles Colours Mondrian?

    1 6 1 10 4No

    2 4 2 8 5 No

    3 5 2 7 4 Yes

    4 5 1 8 4 Yes

    5 5 1 10 5 No

    6 6 1 8 6 Yes

    7 7 1 14 5 No

    Number Lines Line types Rectangles Colours Mondrian?

    8 7 2 9 4

    Test instance

  • 7/30/2019 Classification DMM

    41/50

    41

    Normalization

    Min-max normalization: to [new_minA, new_maxA]

    Z-score normalization (: mean, : standard deviation):

    Normalization by decimal scaling

    AAA

    AA

    A

    minnewminnewmaxnewminmax

    minvv _)__('

    A

    Avv

    '

    j

    vv

    10' Where j is the smallest integer such that Max(||) < 1

  • 7/30/2019 Classification DMM

    42/50

    42

    Normalised training data

    Number Lines Linetypes

    Rectangles Colours Mondrian?

    1 0.632 -0.632 0.327 -1.021 No

    2 -1.581 1.581 -0.588 0.408 No

    3 -0.474 1.581 -1.046 -1.021 Yes

    4 -0.474 -0.632 -0.588 -1.021 Yes

    5 -0.474 -0.632 0.327 0.408 No

    6 0.632 -0.632 -0.588 1.837 Yes

    7 1.739 -0.632 2.157 0.408 No

    Number Lines Linetypes

    Rectangles Colours Mondrian?

    8 1.739 1.581 -0.131 -1.021

    Test instance

  • 7/30/2019 Classification DMM

    43/50

    43

    Distances of test instance from training data

    Example Distanceof testfromexample

    Mondrian?

    1 2.517 No

    23.644

    No

    3 2.395 Yes

    4 3.164 Yes

    5 3.472 No

    6 3.808 Yes

    7 3.490 No

    Classification

    1-NN Yes

    3-NN Yes

    5-NN No

    7-NN No

  • 7/30/2019 Classification DMM

    44/50

    44

    Example

    Classify Theatre ?

  • 7/30/2019 Classification DMM

    45/50

    45

    Nearest neighbors algorithms: illustration

    ++

    ++

    -

    -

    -

    -

    -

    -

    e1

    1-nearest neighbor:

    the concept represented by e1

    5-nearest neighbors:

    q1 is classified as negative

    q1

  • 7/30/2019 Classification DMM

    46/50

    46

    Voronoi diagram

    querypoint qf

    nearest neighbor qi

  • 7/30/2019 Classification DMM

    47/50

    47

    Variant of kNN: Distance-Weighted kNN

    We might want to weight nearer neighbors more heavily

    Then it makes sense to use all training examples instead of just k

    2

    1

    1

    ),(

    1where

    )(:)(

    iq

    ik

    i i

    k

    i ii

    qd

    ww

    fwf

    xx

    x

    x

  • 7/30/2019 Classification DMM

    48/50

    48

    Difficulties with k-nearest neighbour

    algorithms

    Have to calculate the distance of the test case from all

    training cases

    There may be irrelevant attributes amongst the attributes

    curse of dimensionality

    For instance , we have 20 attribute, however 2 of them are

    appropriate attributes , non-relevant attributes effect to

    results

    Solution : Assign appropriate Weight to the attributes by

    using cross-validation method.

  • 7/30/2019 Classification DMM

    49/50

    49

    How to choose k

    Large k:

    less sensitive to noise (particularly class noise)

    better probability estimates for discrete classes

    larger training sets allow larger values of k

    Small k: captures fine structure of problem space better

    may be necessary with small training sets

    Balance must be struck between large and small k

    As training set approaches infinity, and k grows large, kNNbecomes Bayes optimal

  • 7/30/2019 Classification DMM

    50/50

    1. Use Weka to classify and evaluate Letter Recognition dataset through IBL2. Write down 3 new challenges in KNN