Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1...

2014-05-08

1

Classification

Jian Pei: Big Data Analytics -- Classification 2

Classification and Prediction

•  Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)

•  Prediction: model continuous-valued functions – Predict the economic growth in 2015


Classification: A 2-step Process

•  Model construction: describe a set of predetermined classes –  Training dataset: tuples for model construction

•  Each tuple/sample belongs to a predefined class

–  Classification rules, decision trees, or math formulae

•  Model application: classify unseen objects –  Estimate accuracy of the model using an independent

test set –  Acceptable accuracy à apply the model to classify

tuples with unknown class labels


Model Construction

Training Data

Classification Algorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier (Model)

Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes

Dave Ass. Prof 6 No Anne Asso. Prof 3 No


Model Application

Classifier

Testing Data Unseen Data

(Jeff, Professor, 4)

Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No

Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes


Supervised/Unsupervised Learning

•  Supervised learning (classification) – Supervision: objects in the training data set have

labels – New data is classified based on the training set

•  Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.

with the aim of establishing the existence of classes or clusters in the data

2014-05-08

2


Data Preparation

•  Data cleaning – Preprocess data in order to reduce noise and

handle missing values •  Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes •  Data transformation

– Generalize and/or normalize data


Measurements of Quality

•  Prediction accuracy •  Speed and scalability

– Construction speed and application speed •  Robustness: handle noise and missing

values •  Scalability: build model for large training data

sets •  Interpretability: understandability of models


Decision Tree Induction

•  Decision tree representation •  Construction of a decision tree •  Inductive bias and overfitting •  Scalable enhancements for large databases


Decision Tree

•  A node in the tree – a test of some attribute •  A branch: a possible value of the attribute •  Classification

– Start at the root – Test the attribute – Move down the tree branch

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Yes Wind

Strong Weak

No Yes


Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No


Appropriate Problems

•  Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-

valued attributes •  Disjunctive descriptions may be required •  The training data may contain errors or

missing values

2014-05-08

3


Basic Algorithm ID3

•  Construct a tree in a top-down recursive divide-and-conquer manner –  Which attribute is the best at the current node? –  Create a node for each possible attribute value –  Partition training data into descendant nodes

•  Conditions for stopping recursion –  All samples at a given node belong to the same class –  No attribute remained for further partitioning

•  Majority voting is employed for classifying the leaf

–  There is no sample at the node


Which Attribute Is the Best?

•  The attribute most useful for classifying examples

•  Information gain and gini index – Statistical properties – Measure how well an attribute separates the

training examples


Entropy

•  Measure homogeneity of examples

– S is the training data set, and pi is the proportion of S belong to class i

•  The smaller the entropy, the purer the data set

∑=

−≡c

iii ppSEntropy

12log)(


Information Gain

•  The expected reduction in entropy caused by partitioning the examples according to an attribute

∑∈

−≡)(

)(||||)(),(

AValuesvv

v SEntropySSSEntropyASGain

Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v


Example Outlook Temp Humid Wind PlayTenni

s Sunny Hot High Weak No Sunny Hot High Strong No




Rain Mild High Strong No 94.0145log

145

149log

149)( 22

=

−−=SEntropy

048.000.1146811.0

14894.0

)(146)(

148)(

)(||||)(),(

},{

=×−×−=

−−=

−= ∑∈

StrongWeak

StrongWeakvv

v

SEngropySEngropySEntropy

SEntropySSSEntropyWindSGain


Hypothesis Space Search in Decision Tree Building •  Hypothesis space: the set of possible

decision trees •  ID3: simple-to-complex, hill-climbing search

– Evaluation function: information gain

2014-05-08

4


Capabilities and Limitations

•  The hypothesis space is complete •  Maintains only a single current hypothesis •  No backtracking

– May converge to a locally optimal solution •  Use all training examples at each step

– Make statistics-based decisions – Not sensitive to errors in individual example


Natural Bias

•  The information gain measure favors attributes with many values

•  An extreme example – Attribute “date” may have the highest

information gain – A very broad decision tree of depth one –  Inapplicable to any future data


Alternative Measures

•  Gain ratio: penalize attributes like date by incorporating split information – 

•  Split information is sensitive to how broadly and uniformly the attribute splits the data

–  •  Gain ratio can be undefined or very large

– Only test attributes with over average gain

||||log

||||),(

12 SS

SSASmationSplitInfor i

c

i

i∑=

−≡

),(),(),(

ASmationSplitInforASGainASGainRatio ≡


Measuring Inequality

Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality

Gini index

Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution


Gini Index (Adjusted)

•  A data set S contains examples from n classes

– pj is the relative frequency of class j in S •  A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively

•  The attribute provides the smallest ginisplit(T) is chosen to split the node

∑=

−=n

jp jTgini121)(

)()()( 22

11 Tgini

NNTgini

NNTginisplit +=


Extracting Classification Rules

•  Classification rules can be extracted from a decision tree

•  Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a

conjunctive condition – The leaf node holds the class prediction –  IF age = “<=30” AND student = “no” THEN

buys_computer = “no” •  Rules are easy to understand

2014-05-08

5


Inductive Bias

•  Inductive bias: the set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction

•  Shorter trees are preferred over longer trees •  Trees that place high information gain

attributes close to the root are preferred


Why Prefer Short Trees?

•  Occam’s razor: prefer the simplest hypothesis that fits the data

•  Fewer short trees than long trees •  A short tree is less likely to be a statistical

coincidence

“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony


Overfitting

•  A decision tree T may overfit the training data –  if ∃ alternative tree T’ s.t. T has a higher

accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data

•  Why overfitting? – Noise data

All data Training data

T T’


Avoid Overfitting

•  Prepruning: stop growing the tree earlier – Difficult to choose an appropriate threshold

•  Postpruning: remove branches from a “fully grown” tree – Use an independent set of data to prune

•  Key: how to determine the correct final tree size


Determine the Final Tree Size

•  Separate training (2/3) and testing (1/3) sets •  Use cross validation, e.g., 10-fold cross validation •  Use all the data for training

–  Apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

•  Use minimum description length (MDL) principle –  halting growth of the tree when the encoding is

minimized


Enhancements

•  Allow for attributes of continuous values – Dynamically discretize continuous attributes

•  Handle missing attribute values •  Attribute construction

– Create new attributes based on existing ones that are sparsely represented

– Reduce fragmentation, repetition, and replication

2014-05-08

6


The Evaluation Issues

•  The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled

data set •  But how can we evaluate the accuracy of a

classification method? – A classification method can generate many

classifiers •  What if the available labeled data set is too

small? Jian Pei: Big Data Analytics -- Classification 32

Holdout Method

•  Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing

•  Build a classifier using the training set •  Evaluate the accuracy using the test set


Limitations of Holdout Method

•  Fewer labeled examples for training •  The classifier highly depends on the

composition of the training and test sets – The smaller the training set, the larger the

variance •  If the test set is too small, the evaluation is

not reliable •  The training and test sets are not

independent Jian Pei: Big Data Analytics -- Classification 34

Cross-Validation

•  Each record is used the same number of times for training and exactly once for testing

•  K-fold cross-validation –  Partition the data into k equal-sized subsets –  In each round, use one subset as the test set, and use

the rest subsets together as the training set –  Repeat k times –  The total error is the sum of the errors in k rounds

•  Leave-one-out: k = n –  Utilize as much data as possible for training –  Computationally expensive


Bootstrap

•  Use a bootstrap sample as the training set, use the tuples not in the training set as the test set

•  .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set

)368.0632.0(11

632. all

k

ibootstrap acck

acc ×+×= ∑ ε


Confidence Interval for Accuracy

•  Suppose a classifier C is tested on a test set of n cases, and the accuracy is acc

•  How much confidence can we have on acc? •  We need to estimate the confidence interval

of a given model accuracy

2014-05-08

7


Binomial Experiments

•  When a coin is flipped, it has a probability p to have the head turned up

•  If the coin is flipped N times, what is the probability that we see the head X times? – Expectation (mean): Np – Variance: Np(1 - p)

vNv ppvN

vXP −−⎟⎟⎠

⎞⎜⎜⎝

⎛== )1()(


Confidence Level and Approximation Area = 1 - α

Zα/2 Z1- α /2

α

αα

−=

<−−

<−

1

)/)1(

(2/12/

ZNpp

paccZP

)(2442

22/

222/2/

22/

α

ααα

ZNaccNaccNZZZaccN

+

⋅−⋅+±+⋅

Zα: the bound at confidence level (1-α)

Approximating using normal distribution


Accuracy Can Be Misleading …

•  Consider a data set of 99% of the negative class and 1% of the positive class

•  A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!

•  Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …


Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Confusion matrix: used for imbalance class distribution


Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS


True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)


Recall and Precision

•  Target class is more important than the other classes

PREDICTED CLASS

ACTUAL CLASS


Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)

2014-05-08

8


Fallout

•  Type I errors – false positive: a negative object is classified as positive – Precision is not related directly to this error!

(Precision p = TP / (TP + FP)) – Fallout: the type I error rate, FP / (TP + FP)

•  Type II errors – false negative: a positive object is classified as negative – Captured by recall


Fβ Measure

•  How can we summarize precision and recall into one metric? –  Using the harmonic mean between the two

•  Fβ measure

–  β = 0, Fβ is the precision –  β = ∞, Fβ is the recall –  0 < β < ∞, Fβ is a tradeoff between the precision and the

recall

FNFPTPTP

prrp

++=

+=

222(F) measure-F

FNFPTPTP

prrpF

+++

+=

+

+= 22

2

2

2

)1()1()1(ββ

ββ

ββ


Weighted Accuracy

•  A more general metric

dwcwbwawdwaw

4321

41Accuracy Weighted+++

+=

Measure w1 w2 w3 w4 Recall 1 1 0 0

Precision 1 0 1 0

Fβ β2 + 1 β2 1 0

Accuracy 1 1 1 1


ROC Curve

•  Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive


ROC Curve (TP,FP): •  (0,0): declare everything

to be negative class •  (1,1): declare everything

to be positive class •  (1,0): ideal •  Diagonal line:

–  Random guessing –  Below diagonal line:

prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]


Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]

2014-05-08

9


Cost-Sensitive Learning

•  In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection

•  Using a cost matrix PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes -1 100 Class=No 1 0


Sampling for Imbalance Classes

•  Consider a data set containing 100 positive examples and 1,000 negative examples

•  Undersampling: use a random sample of 100 negative examples and all positive examples –  Some useful negative examples may be lost –  Run undersampling multiple times, use the ensemble of

multiple base classifiers –  Focused undersampling: remove negative samples that

are not useful for classification, e.g., those far away from the decision boundary


Oversampling

•  Replicate the positive examples until the training set has an equal number of positive and negative examples

•  For noisy data, may cause overfitting


Significance Tests •  Are two algorithms different in effectiveness?

–  The null hypothesis: there is NO difference –  The alternative hypothesis: there is a difference – B is better than A

(the baseline method) •  Matched pair experiments: the rankings that are compared

are based on the same set of queries for both algorithms •  Possible errors of significant tests

–  Type I: the null hypothesis is rejected when it is true –  Type II: the null hypothesis is accepted when it is false

•  The power of a hypothesis test: the probability that the test will reject the null hypothesis correctly –  Reducing the type II errors


Procedure of Comparison •  Using a set of data sets •  Procedure

–  Compute the effectiveness measure for every data set –  Compute a test statistic based on a comparison of the effectiveness

measures for each data set •  E.g., the t-test, the Wilcoxon signed-rank test, and the sign test

–  Compute a P-value: the probability that a test statistic value at least that extreme could be observed if the null hypothesis were true

–  The null hypothesis is rejected if the P-value ≤ α, where α is the significance level which is used to minimize the type I errors

•  One-sided (one-tailed) tests: whether B is better than A (the baseline method) –  Two-sided tests: whether A and B are different – the P-value is

doubled


Distribution of Test Statistics

2014-05-08

10


T-test

•  Assuming data values are sampled from normal distributions –  In a matched pair experiment, assuming the difference

between the effectiveness values is a sample from a normal distribution

•  The null hypothesis: the mean of the distribution of difference is 0

–  B – A is the mean of the differences, σB – A is the standard deviation of the differences

NABtAB−

−=σ

∑=

−=N

ii xx

N 1

22 )(1σ


Example

33.21.294.21

=

=

=−

−

t

AB

ABσ

P-value = 0.02 significant at a level of σ = 0.05 – the null hypothesis can be rejected


Issues in T-test •  Data is assumed to be sampled from normal

distributions –  Generally inappropriate for effectiveness measures –  However, experiments showed that t-test produces very

similar results to the randomization test which does not assume any distribution (the most powerful nonparametric test)

•  T-test assumes that the evaluation data is measured on an interval scale –  Effectiveness measures are ordinal – the magnitude of

the differences are not significant –  Use the Wilcoxon signed-rank test and the sign test,

which make less assumption about the effectiveness measure, but are less powerful


Wilcoxon Signed-Rank Test •  Assumption: the differences between the effectiveness

values can be ranked, but the magnitude is not important

–  Ri is a signed-rank, N is the number of non-zero differences •  Procedure

–  The differences are sorted by their absolute values increasing order –  Differences are assigned rank values (ties are assigned the average

rank) –  The rank values are given the sign of the original difference

•  The null hypothesis: the sum of the positive ranks will be the same as the sum of the negative ranks

∑=

=N

iiRw

1


Example The non-zero differences in rank order of absolute value: 2, 9, 10, 24, 25, 25, 41, 60, 70 The signed ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 w = 35 P-value = 0.025 significant at a level of σ = 0.05 – the null hypothesis can be rejected


Sign Test

•  Completely ignore the magnitude of the differences –  In practice, we may require that a 5-10%

difference is needed to be considered as different

•  The null hypothesis: P(B > A) = P(A > B) = ½ •  Sum up the number of pairs B > A

2014-05-08

11


Example 7 pairs out of 10 B > A P-value = 0.17 – the probability that we observe 7 successes out of 10 trials where the probability of success is 0.5 Cannot reject the null hypothesis


Intuition – Bayesian Classification

•  More hockey fans in Canada than in US –  Which country is Tom, a hockey ball fan, from? –  Predicting Canada has a better chance to be right

•  Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians

•  P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan

•  Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada


Bayes Theorem

•  Find the maximum a posteriori (MAP) hypothesis

– Require background knowledge – Computational cost

)()()|()|(

DPhPhDPDhP =

)()|(max)()()|(max)|(max

hPhDPDP

hPhDPDhPh

Hh

HhHhMAP

∈

∈∈

=

=≡


Naïve Bayes Classifier

•  Assumption: attributes are independent •  Given a tuple (a1, a2, …, an), predict its

class as

–  : the value of x that maximizes f(x) •  Example:

∏=

=

jiji

i

iini

CaPCP

CPCaaaPC

)|()(maxarg

)()|,,,(maxarg 21 …

)(maxarg xf3maxarg 2

}3,2,1{−=

−∈x

x


Example: Training Dataset

Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No





P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007

Probability of Infrequent Values

•  (outlook = Sunny, temp = high, humid = low, wind = weak)?

•  P(humid = low) = 0


Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No





2014-05-08

12

Smoothing

•  Suppose an attribute has n different values: a1, …, an

•  Assume a small enough value ε > 0 •  Let Pi be the frequency of ai,

Pi = # tuples having ai / total # of tuples •  Estimate


P (ai) = ✏+1� n✏

nPi

Handling Continuous Attributes

•  Discretization •  Probability density estimation


Density Estimation

•  Let and be the mean and variance of all samples of class Cj, respectively


P (Xi = xi|Cj) =1p

2⇡�ij

e

� (xi

�µ

ij

)2

2�2ij

µij �2ij

Characteristics of Naïve Bayes

•  Robust to isolated noise points – Such points are averaged out in probability

computation •  Insensitive to missing values •  Robust to irrelevant attributes

– Distributions on such attributes are almost uniform

•  Correlated attributes degrade the performance


Bayes Error Rate

•  The error rate of the ideal naïve Bayes classifier


Err =

x̂Z

0

P (Crocodile | X)dX +

1Z

x̂

P (Alligator | X)dX


Pros and Cons

•  Pros – Easy to implement – Good results obtained in many cases

•  Cons – A (too) strong assumption: independent

attributes •  How to handle dependent/correlated

attributes? – Bayesian belief networks

2014-05-08

13


Bayesian Networks

•  Bayesian belief network allows a subset of the variables conditionally independent

•  A graphical model of causal relationships – Represents dependency among the variables – Gives a specification of joint probability

distribution

X Y

Z P

Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P no loops or cycles


Bayesian Belief Network: Example

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Family History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

The conditional probability table (CPT) for the variable LungCancer: Show the conditional probability for each possible combination of its parents

∏

==

n

iZParents iziPznzP

1))(|(),...,1(


Training Bayesian Networks

•  Given both the network structure and all variables observable: learn only the CPTs

•  Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning

•  Network structure unknown, all variables observable: search through the model space to reconstruct graph topology

•  Unknown structure, all hidden variables: no good algorithms known for this purpose


Associative Classification

•  Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label

•  Build classifier – Organize rules according to decreasing

precedence based on confidence and support •  Classification

– Use the first matching rule to classify an unknown case


Associative Classification Methods

•  CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) –  Mine association possible rules in the form of

•  Cond-set (a set of attribute-value pairs) à class label

–  Build classifier: Organize rules according to decreasing precedence based on confidence and then support

•  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) –  Classification: Statistical analysis on multiple rules


CMAR – Model Generation

•  Classification based on Multiple Association Rules •  Efficiency: Uses an enhanced FP-tree that

maintains the distribution of class labels among tuples satisfying each frequent itemset

•  Rule pruning whenever a rule is inserted into the tree –  Given two rules, R1 and R2, if the antecedent of R1 is

more general than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned

–  Prune rules where the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance

2014-05-08

14


CMAR – Classification

•  Classification based on generated/pruned rules

•  If only one rule satisfies tuple X, assign the class label of the rule

•  If a rule set S satisfies X, CMAR – Divide S into groups according to class labels – Use a weighted χ2 measure to find the strongest

group of rules, based on the statistical correlation of rules within a group

– Assign X the class label of the strongest group Jian Pei: Big Data Analytics -- Classification 80

Classification by Aggregating Emerging Patterns •  Emerging pattern (EP): A pattern frequent in

one class of data but infrequent in others – Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no”

– Rule: age<=30 à buys computer •  G. Dong & J. Li. Efficient mining of emerging

patterns: discovering trends and differences. In KDD’99


How to Mine Emerging Patterns?

•  Border differential –  Max-patterns in D1 w.r.t. min_sup=90% –  Max-patterns in D2 w.r.t. min_sup=10% –  X is a pattern covered by a max-pattern in D1 but not by

a max-pattern in D2 à X is an emerging pattern •  Method

–  Mine max-patterns in D1 and D2, respectively –  Compare the two sets of borders, find the “maximal”

patterns that are frequent in D1 and infrequent D2


Instance-based Methods

•  Instance-based learning –  Store training examples and delay the processing until a

new instance must be classified (“lazy evaluation”) •  Typical approaches

–  K-nearest neighbor approach •  Instances represented as points in an Euclidean space

–  Locally weighted regression •  Construct local approximation

–  Case-based reasoning •  Use symbolic representations and knowledge-based inference


The K-Nearest Neighbor Method

•  Instances are points in an n-D space •  The k-nearest neighbors (KNN) in the

Euclidean distance – Return the most common value among the k

training examples nearest to the query point •  Discrete-/real-valued target functions

. _

+ _ xq

+

_ _ +

_

_

+


KNN Methods

•  For continuous-valued target functions, return the mean value of the k nearest neighbors

•  Distance-weighted nearest neighbor algorithm –  Give greater weights to closer neighbors

•  Robust to noisy data by averaging k-nearest neighbors

•  Curse of dimensionality –  Distance could be dominated by irrelevant attributes –  Axes stretch or elimination of the least relevant attributes

wd xq xi

≡ 12( , )

2014-05-08

15


Case-based Reasoning

•  Lazy evaluation + analysis of similar instances

•  Methodology –  Instances represented by rich symbolic

descriptions (e.g., function graphs) – Combine multiple retrieved cases – Tight coupling between case retrieval,

knowledge-based reasoning, and problem solving


Lazy vs. Eager Learning

•  Efficiency: lazy learning uses less training time but more predicting time

•  Accuracy – Lazy method effectively uses a richer hypothesis

space – Eager: must commit to a single hypothesis that

covers the entire instance space


Artificial Neural Networks

•  (To some extent) simulating biological neural networks

•  Basic mechanisms – Perceptrons – Multilayer networks

•  Essential algorithm: BACKPROPAGATION


Perceptrons

•  Input: a vector of real-values •  Calculate a linear combination of inputs

– Output 1 if the result is positive, -1 otherwise

Σ x1 x2

xn

.

.

.

x0=1 w1

w2

wn

w0

∑=

n

iii xw

0⎪⎩

⎪⎨⎧

−

>= ∑

=

otherwise

xwifon

iii

1

010


Why Perceptrons?

•  A perceptron is a hyperplane decision surface

•  Perceptrons represent all primitive Boolean functions – AND, OR, …

+ +

+ + +

- -

-

Linearly separable

+

+ -

-

Linearly inseparable

AND: w0=-0.8 w1=w2=0.5 OR: w0=-0.3 w1=w2=0.5


Training Perceptrons

•  Begin with random weights •  Iteratively apply the perceptron to each

training example, modify the weights whenever it misclassifies an example – wißwi+Δwi – Δwi = η(t-o)xi –  t: the target output for the current example – o: the output generated by the perceptron – η: a positive constant called learning rate

2014-05-08

16


Why Does the Training Work?

•  The training example is correctly classified –  (t-o) = 0 à Δwi = 0

•  The target output is +1 and the perceptron outputs –1 –  (t-o) = 2 –  If xi > 0, increasing wi will increase the output –  If xi < 0, decreasing wi will increase the output

wiß wi+Δwi

Δwi = η(t-o)xi


The Sigmoid Unit

•  Similar to perceptron, except for the output function

•  σ: the sigmoid or logistic function

Σ x1 x2

xn

.

.

.

x0=1 w1

w2

wn

w0

∑=

n

iii xw

0 neteneto

−+==11)(σ


Multilayer Networks

•  Each layer has some perceptrons/sigmoid units

•  A unit connects to all units in neighbor layers


Backpropagation Algorithm

•  Training a multilayer network •  Create a feed-forward network •  Initialize all network weights to small random

numbers (e.g., -0.5 to 0.5) •  Until the termination condition is met, do

– For each training example •  Propagate the input forward through the network •  Propagate the errors backward through the network •  Update each network weight


Termination Conditions

•  A fixed number of iterations •  Once the error on the training examples is

below some threshold •  Once the error on the test set meets some

criterion


When Are ANNs Good?

•  The training data are noisy and complex – Example: sensor data, image data

•  More symbolic representations are often used – Similar to the capability of decision trees

2014-05-08

17


Appropriate Problems for ANNs

•  Many attributes •  Target function may be discrete- or real-

valued, or a vector of multiple attributes •  Errors in training examples •  Long training time is acceptable •  Fast classification time is required •  Understandability is unimportant


Support Vector Machines (SVM)

Support Vectors

Small Margin Large Margin


Linear SVM

•  Given a set of points with label

•  The SVM finds a hyperplane separating the positive and negative samples

nix ℜ∈

}1,1{yi −∈


Separate Samples by Projection

•  For linearly inseparable data, project the data to high dimensional space where it is linearly separable

-1 0 +1

+ + -

(1,0) (0,0)

(0,1) +

+ -


Non-linear SVM

•  Learn a hyperplane •  Use quadratic programming techniques •  Using kernels can learn very complex

functions


Non-linear SVM: An Example

2014-05-08

18


Errors in Classification

•  Bias: the difference between the real class boundary and the decision boundary of a classification model

•  Variance: variability in the training data set •  Intrinsic noise in the target class: the target

class can be non-deterministic – instances with the same attribute values can have different class labels


Bias



One or More?

•  What if a medical doctor is not sure about a case? –  Joint-diagnosis: using a group of doctors carrying

different expertise –  Wisdom from crowd is often more accurate

•  All eager learning methods make prediction using a single classifier induced from training data –  A single classifier may have low confidence in some

cases •  Ensemble methods: construct a set of base

classifiers and take a vote on predictions in classification


Ensemble Classifiers Original

Training data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))



Why May Ensemble Method Work?

•  Suppose there are two classes and each base classifier has an error rate of 35%

•  What if we use 25 base classifiers? –  If all base classifiers are identical, the ensemble

error rate is still 35% –  If base classifiers are independent, the

ensemble makes a wrong prediction only if more than half of the base classifiers are wrong

∑=

− =⎟⎟⎠

⎞⎜⎜⎝

⎛25

13

25 06.065.035.025

i

ii

iJian Pei: Big Data Analytics -- Classification 108

Ensemble Error Rate


2014-05-08

19


Ensemble Classifiers – When?

•  The base classifiers should be independent of each other

•  Each base classifier should do better than a classifier that performs random guessing


How to Construct Ensemble?

•  Manipulating the training set: derive multiple training sets and build a base classifier on each

•  Manipulating the input features: use only a subset of features in a base classifier

•  Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote

•  Manipulating the learning algorithm, e.g., using different network configuration in ANN


Bootstrap

•  Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement

•  If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632

•  Use the tuples not in T’ as the test set


Bagging •  Run bootstrap k times to obtain k base classifiers •  A test instance is assigned to the class that

receives the highest number of votes •  Strength: reduce the variance of base classifiers –

good for unstable base classifiers –  Unstable classifiers: sensitive to minor perturbations in

the training set, e.g., decision trees, associative classifiers, and ANN

•  For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller

•  Less overfitting on noisy data


Boosting •  Assign a weight to each training example

–  Initially, each example is assigned a weight 1/n •  Weights can be used in one of the following ways

–  Weights as a sampling distribution to draw a set of bootstrap samples from the original training set

–  Weights used by a base classifier to learn a model biased towards heavier examples

•  Adaptively change the weight at the end of each boosting round –  The weight of an example correctly classified decreases –  The weight of an example incorrectly classified

increases •  Each round generates a base classifier


Critical Design Choices in Boosting

•  How the weights of the training examples are updated at the end of each boosting round?

•  How the predictions made by base classifiers are combined?

2014-05-08

20


AdaBoost

•  Each base classifier carries an importance score related to its error rate – Error rate

– wi: weight, I(p) = 1 if p is true –  Importance score

( )∑=

≠=N

jjjiji yxCIw

N 1)(1

ε

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα

1ln21


How Does Importance Score Work?


Weight Adjustment in AdaBoost

–  If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n

•  The ensemble error rate is bounded

∑ =

⎪⎩

⎪⎨⎧

≠

==

+

−+

i

)1(

)()1(

1 factor,ion normalizat theis where

)( ifexp)( ifexp

jij

iij

iij

j

jij

i

wZ

yxCyxC

Zww

j

j

α

α

∏ −≤i

iiensemblee )1( εε


Random Forests



Random Forests

•  Using decision trees as base classifiers •  Each tree uses only a subset of features in

classification •  Bagging can be used to generate training

sets for decision trees


Forest-RI (Random Input)

•  Each tree is built using a random subset of features

•  A tree is grown to its entirety without pruning •  The smaller the sets of features used by decision

trees, the less correlated the trees •  The larger the sets of features used by decision

trees, the more accurate the trees •  Tradeoff: m = log2d + 1

–  d: number of features in the training set •  If d is too small, it is hard to obtain independent

feature sets

2014-05-08

21


Forest-RC

•  When the number of features in the training set is too small, use linear combinations of features to generate new features to build decision trees

•  Randomly select a set of features L •  Linearly combine features in L using coefficients

generated from a uniform distribution in the range of [-1, 1]

•  At each node, m such randomly combined new features are generated, and the best of them is used to split the node

Prediction and Time •  “Prediction is very difficult, especially about the

future.” — Niels Bohr (1885 - 1962)

•  “An economist is an expert who will know tomorrow why the things he predicted yesterday didn’t happen today.”

— Laurence J. Peter (1919 - 1988) •  “If something anticipated arrives too late it finds us

numb, wrung out from waiting, and we feel – nothing at all. The best things arrive on time.”

— Dorothy Gilman, A New Kind of Country, 1978


Early Diagnosis

•  A retrospective study of the clinical data of infants admitted to a neonatal intensive care unit found that the infants, who were diagnosed with sepsis disease, had abnormal heartbeat time series patterns 24 hours preceding the diagnosis

•  Monitoring the heartbeat time series data and classifying the time series data as early as possible may lead to early diagnosis and effective therapy


Online Traffic Classification

•  By only observing the first five packages of a TCP connection, the application associated with the traffic flow can be classified accurately

•  The applications of online traffic can be identified without waiting for the TCP flow to end


Objectives in Classification

•  Objective function for optimization: quality – how well does the model learned approach the latent mechanism? – Accuracy – Recall – F-measure – Many other measures

•  Time is not a first-class citizen in the problem setting!


Early Classification – Problem

•  Data: time series / temporal sequences with class labels

•  Task: construct a model that can produce class label prediction as early as possible – How can we measure the earliness? – Expected classification time (other measures are

possible) •  Constraint: classification quality — should be at

a satisfactory level – How can we constrain the quality? An important

issue


2014-05-08

22

The Price on Time

•  In the classical classification problem, time is not considered as a factor

•  In early classification, time has a price — a model becomes more and more expensive as it consumes longer and longer prefixes


Time

in early classification

in traditional classificationThe complexity of the model concerned

The complexity of the model conerned

Example


−5

0

5

−5

0

5

−5

0

5

−5

0

5

0

1

2

0

1

2

0

1

2

0 2 4 6 8 10 120

1

2

Class Star Class Dimond

Feature A

Feature B

Challenges

•  How to incorporate time in classification? •  How to avoid overfitting in time? •  How to make a balanced tradeoff between time

and quality? •  Simplification

– All time series are normalized and aligned – All time series have the same length – For each time moment, we can get a snapshot of

the distribution of the time series — each time series is a point


Example


Find the earliest classification time

t2t1 t3Time

A Nearest Neighbor Idea

•  Exploiting the extreme of local features — Using 1NN to classify

•  A time series can be used reliably in 1NN classification if its local neighborhood belongs to the same class –  For a time series s, the reverse 1NN’s of s use s in

classification — s* is a reverse1NN of s if s is the1NN of s* –  The local neighborhood of s can be captured by the

reverse 1NN’s of s –  The local neighborhood of s is pure if all time series in

the neighborhood belong to the same class as s does


Impure

s s

Pure

Minimum Prediction Length

•  The length of the prefix that can be trusted to classify other time series as accurate as the full length


!!L=50!

MPL=33!

MPL=33!

MPL=41!

MPL=41!

!!!!!a!!!!b!!!!!c!!!!d!

NN(s)=a[1,1]!!!!!(<33)!!!

NN(s)=d[1,41]!(>=41),!RED!CLASS!!!

2014-05-08

23

Extracting Interpretable Features

•  Using features can improve generality of classifiers and help to avoid overfitting

•  Features can characterize the latent data generating mechanism

•  In some applications, such as medical diagnosis, users want to know not only what (i.e., the class labels), but also why (i.e., the manifesting features explaining the decision process)


Shapelets as Features

•  A shapelet is duple: a time series segment and a distance threshold (s,δ)

•  Optimization objective — best information gain


!Diamond! !Star!

Feature B

−5

0

5

−5

0

5

−5

0

5

−5

0

5

0

1

2

0

1

2

0

1

2

0 2 4 6 8 10 120

1

2

Class Star Class Dimond

Feature A

Feature B

Local Distinctive Shapelets

•  Finding shapelets shared by some times series that belong to the same class but not the other time series

•  Finding shapelets that appear early in time series

•  Rank features according to their precision, recall, and earliness – Use a minimum threshold to avoid overfitting – More sophisticated methods may be used


Early Classification Is Accurate


0 20 40 60 80

100

ECGGunPoint

CBFSyn-Con

WaferOlive

Two-Patterns

Accu

racy

Data sets

EDSC-CHEEDSC-KDE

ECTSFULL1NN

Earliness


0 20 40 60 80

100

ECGGunPoint

CBFSyn-Con

WaferOlive

Two-PatternsPerc

enta

ge o

f Ave

. Pre

dict

ion

Len.

Data sets

EDSC-CHEEDSC-KDE

ECTSFULL1NN

Features (Gun-Point Data Set)


2014-05-08

24

Summary •  Early classification is useful in a few important

applications –  Medical and health informatics applications –  Intrusion detection –  Security and safety –  Some of our algorithms are being tested for astronaut

tiredness prediction in a spaceship project •  Early prediction may be connected to many frontiers

of data mining research –  Subspace feature extraction –  Concise and non-redundant feature representation

•  The problem has not been thoroughly investigated


Open Problems

•  How to balance earliness and classification quality? – Progressive and interactive early classification?

•  How to “rescue” mis-classified samples over time? – Re-classification?

•  Theoretical foundation for early classification? •  Earliness aware data mining? •  A more general model for cost-sensitive

learning? Jian Pei: Big Data Analytics -- Classification 140

Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1...

Documents

Transcript of Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1...