Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1...
Transcript of Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1...
2014-05-08
1
Classification
Jian Pei: Big Data Analytics -- Classification 2
Classification and Prediction
• Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)
• Prediction: model continuous-valued functions – Predict the economic growth in 2015
Jian Pei: Big Data Analytics -- Classification 3
Classification: A 2-step Process
• Model construction: describe a set of predetermined classes – Training dataset: tuples for model construction
• Each tuple/sample belongs to a predefined class
– Classification rules, decision trees, or math formulae
• Model application: classify unseen objects – Estimate accuracy of the model using an independent
test set – Acceptable accuracy à apply the model to classify
tuples with unknown class labels
Jian Pei: Big Data Analytics -- Classification 4
Model Construction
Training Data
Classification Algorithms
IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier (Model)
Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes
Dave Ass. Prof 6 No Anne Asso. Prof 3 No
Jian Pei: Big Data Analytics -- Classification 5
Model Application
Classifier
Testing Data Unseen Data
(Jeff, Professor, 4)
Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No
Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes
Jian Pei: Big Data Analytics -- Classification 6
Supervised/Unsupervised Learning
• Supervised learning (classification) – Supervision: objects in the training data set have
labels – New data is classified based on the training set
• Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes or clusters in the data
2014-05-08
2
Jian Pei: Big Data Analytics -- Classification 7
Data Preparation
• Data cleaning – Preprocess data in order to reduce noise and
handle missing values • Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes • Data transformation
– Generalize and/or normalize data
Jian Pei: Big Data Analytics -- Classification 8
Measurements of Quality
• Prediction accuracy • Speed and scalability
– Construction speed and application speed • Robustness: handle noise and missing
values • Scalability: build model for large training data
sets • Interpretability: understandability of models
Jian Pei: Big Data Analytics -- Classification 9
Decision Tree Induction
• Decision tree representation • Construction of a decision tree • Inductive bias and overfitting • Scalable enhancements for large databases
Jian Pei: Big Data Analytics -- Classification 10
Decision Tree
• A node in the tree – a test of some attribute • A branch: a possible value of the attribute • Classification
– Start at the root – Test the attribute – Move down the tree branch
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Yes Wind
Strong Weak
No Yes
Jian Pei: Big Data Analytics -- Classification 11
Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
Jian Pei: Big Data Analytics -- Classification 12
Appropriate Problems
• Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-
valued attributes • Disjunctive descriptions may be required • The training data may contain errors or
missing values
2014-05-08
3
Jian Pei: Big Data Analytics -- Classification 13
Basic Algorithm ID3
• Construct a tree in a top-down recursive divide-and-conquer manner – Which attribute is the best at the current node? – Create a node for each possible attribute value – Partition training data into descendant nodes
• Conditions for stopping recursion – All samples at a given node belong to the same class – No attribute remained for further partitioning
• Majority voting is employed for classifying the leaf
– There is no sample at the node
Jian Pei: Big Data Analytics -- Classification 14
Which Attribute Is the Best?
• The attribute most useful for classifying examples
• Information gain and gini index – Statistical properties – Measure how well an attribute separates the
training examples
Jian Pei: Big Data Analytics -- Classification 15
Entropy
• Measure homogeneity of examples
– S is the training data set, and pi is the proportion of S belong to class i
• The smaller the entropy, the purer the data set
∑=
−≡c
iii ppSEntropy
12log)(
Jian Pei: Big Data Analytics -- Classification 16
Information Gain
• The expected reduction in entropy caused by partitioning the examples according to an attribute
∑∈
−≡)(
)(||||)(),(
AValuesvv
v SEntropySSSEntropyASGain
Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v
Jian Pei: Big Data Analytics -- Classification 17
Example Outlook Temp Humid Wind PlayTenni
s Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No 94.0145log
145
149log
149)( 22
=
−−=SEntropy
048.000.1146811.0
14894.0
)(146)(
148)(
)(||||)(),(
},{
=×−×−=
−−=
−= ∑∈
StrongWeak
StrongWeakvv
v
SEngropySEngropySEntropy
SEntropySSSEntropyWindSGain
Jian Pei: Big Data Analytics -- Classification 18
Hypothesis Space Search in Decision Tree Building • Hypothesis space: the set of possible
decision trees • ID3: simple-to-complex, hill-climbing search
– Evaluation function: information gain
2014-05-08
4
Jian Pei: Big Data Analytics -- Classification 19
Capabilities and Limitations
• The hypothesis space is complete • Maintains only a single current hypothesis • No backtracking
– May converge to a locally optimal solution • Use all training examples at each step
– Make statistics-based decisions – Not sensitive to errors in individual example
Jian Pei: Big Data Analytics -- Classification 20
Natural Bias
• The information gain measure favors attributes with many values
• An extreme example – Attribute “date” may have the highest
information gain – A very broad decision tree of depth one – Inapplicable to any future data
Jian Pei: Big Data Analytics -- Classification 21
Alternative Measures
• Gain ratio: penalize attributes like date by incorporating split information –
• Split information is sensitive to how broadly and uniformly the attribute splits the data
– • Gain ratio can be undefined or very large
– Only test attributes with over average gain
||||log
||||),(
12 SS
SSASmationSplitInfor i
c
i
i∑=
−≡
),(),(),(
ASmationSplitInforASGainASGainRatio ≡
Jian Pei: Big Data Analytics -- Classification 22
Measuring Inequality
Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality
Gini index
Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution
Jian Pei: Big Data Analytics -- Classification 23
Gini Index (Adjusted)
• A data set S contains examples from n classes
– pj is the relative frequency of class j in S • A data set S is split into two subsets S1 and
S2 with sizes N1 and N2 respectively
• The attribute provides the smallest ginisplit(T) is chosen to split the node
∑=
−=n
jp jTgini121)(
)()()( 22
11 Tgini
NNTgini
NNTginisplit +=
Jian Pei: Big Data Analytics -- Classification 24
Extracting Classification Rules
• Classification rules can be extracted from a decision tree
• Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a
conjunctive condition – The leaf node holds the class prediction – IF age = “<=30” AND student = “no” THEN
buys_computer = “no” • Rules are easy to understand
2014-05-08
5
Jian Pei: Big Data Analytics -- Classification 25
Inductive Bias
• Inductive bias: the set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction
• Shorter trees are preferred over longer trees • Trees that place high information gain
attributes close to the root are preferred
Jian Pei: Big Data Analytics -- Classification 26
Why Prefer Short Trees?
• Occam’s razor: prefer the simplest hypothesis that fits the data
• Fewer short trees than long trees • A short tree is less likely to be a statistical
coincidence
“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony
Jian Pei: Big Data Analytics -- Classification 27
Overfitting
• A decision tree T may overfit the training data – if ∃ alternative tree T’ s.t. T has a higher
accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data
• Why overfitting? – Noise data
All data Training data
T T’
Jian Pei: Big Data Analytics -- Classification 28
Avoid Overfitting
• Prepruning: stop growing the tree earlier – Difficult to choose an appropriate threshold
• Postpruning: remove branches from a “fully grown” tree – Use an independent set of data to prune
• Key: how to determine the correct final tree size
Jian Pei: Big Data Analytics -- Classification 29
Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use all the data for training
– Apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
• Use minimum description length (MDL) principle – halting growth of the tree when the encoding is
minimized
Jian Pei: Big Data Analytics -- Classification 30
Enhancements
• Allow for attributes of continuous values – Dynamically discretize continuous attributes
• Handle missing attribute values • Attribute construction
– Create new attributes based on existing ones that are sparsely represented
– Reduce fragmentation, repetition, and replication
2014-05-08
6
Jian Pei: Big Data Analytics -- Classification 31
The Evaluation Issues
• The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled
data set • But how can we evaluate the accuracy of a
classification method? – A classification method can generate many
classifiers • What if the available labeled data set is too
small? Jian Pei: Big Data Analytics -- Classification 32
Holdout Method
• Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing
• Build a classifier using the training set • Evaluate the accuracy using the test set
Jian Pei: Big Data Analytics -- Classification 33
Limitations of Holdout Method
• Fewer labeled examples for training • The classifier highly depends on the
composition of the training and test sets – The smaller the training set, the larger the
variance • If the test set is too small, the evaluation is
not reliable • The training and test sets are not
independent Jian Pei: Big Data Analytics -- Classification 34
Cross-Validation
• Each record is used the same number of times for training and exactly once for testing
• K-fold cross-validation – Partition the data into k equal-sized subsets – In each round, use one subset as the test set, and use
the rest subsets together as the training set – Repeat k times – The total error is the sum of the errors in k rounds
• Leave-one-out: k = n – Utilize as much data as possible for training – Computationally expensive
Jian Pei: Big Data Analytics -- Classification 35
Bootstrap
• Use a bootstrap sample as the training set, use the tuples not in the training set as the test set
• .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set
)368.0632.0(11
632. all
k
ibootstrap acck
acc ×+×= ∑ ε
Jian Pei: Big Data Analytics -- Classification 36
Confidence Interval for Accuracy
• Suppose a classifier C is tested on a test set of n cases, and the accuracy is acc
• How much confidence can we have on acc? • We need to estimate the confidence interval
of a given model accuracy
2014-05-08
7
Jian Pei: Big Data Analytics -- Classification 37
Binomial Experiments
• When a coin is flipped, it has a probability p to have the head turned up
• If the coin is flipped N times, what is the probability that we see the head X times? – Expectation (mean): Np – Variance: Np(1 - p)
vNv ppvN
vXP −−⎟⎟⎠
⎞⎜⎜⎝
⎛== )1()(
Jian Pei: Big Data Analytics -- Classification 38
Confidence Level and Approximation Area = 1 - α
Zα/2 Z1- α /2
α
αα
−=
<−−
<−
1
)/)1(
(2/12/
ZNpp
paccZP
)(2442
22/
222/2/
22/
α
ααα
ZNaccNaccNZZZaccN
+
⋅−⋅+±+⋅
Zα: the bound at confidence level (1-α)
Approximating using normal distribution
Jian Pei: Big Data Analytics -- Classification 39
Accuracy Can Be Misleading …
• Consider a data set of 99% of the negative class and 1% of the positive class
• A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!
• Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …
Jian Pei: Big Data Analytics -- Classification 40
Performance Evaluation Matrix
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
FNFPTNTPTNTP
dcbada
++++
=+++
+=Accuracy
Confusion matrix: used for imbalance class distribution
Jian Pei: Big Data Analytics -- Classification 41
Performance Evaluation Matrix
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)
Jian Pei: Big Data Analytics -- Classification 42
Recall and Precision
• Target class is more important than the other classes
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)
2014-05-08
8
Jian Pei: Big Data Analytics -- Classification 43
Fallout
• Type I errors – false positive: a negative object is classified as positive – Precision is not related directly to this error!
(Precision p = TP / (TP + FP)) – Fallout: the type I error rate, FP / (TP + FP)
• Type II errors – false negative: a positive object is classified as negative – Captured by recall
Jian Pei: Big Data Analytics -- Classification 44
Fβ Measure
• How can we summarize precision and recall into one metric? – Using the harmonic mean between the two
• Fβ measure
– β = 0, Fβ is the precision – β = ∞, Fβ is the recall – 0 < β < ∞, Fβ is a tradeoff between the precision and the
recall
FNFPTPTP
prrp
++=
+=
222(F) measure-F
FNFPTPTP
prrpF
+++
+=
+
+= 22
2
2
2
)1()1()1(ββ
ββ
ββ
Jian Pei: Big Data Analytics -- Classification 45
Weighted Accuracy
• A more general metric
dwcwbwawdwaw
4321
41Accuracy Weighted+++
+=
Measure w1 w2 w3 w4 Recall 1 1 0 0
Precision 1 0 1 0
Fβ β2 + 1 β2 1 0
Accuracy 1 1 1 1
Jian Pei: Big Data Analytics -- Classification 46
ROC Curve
• Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive
Jian Pei: Big Data Analytics -- Classification 47
ROC Curve (TP,FP): • (0,0): declare everything
to be negative class • (1,1): declare everything
to be positive class • (1,0): ideal • Diagonal line:
– Random guessing – Below diagonal line:
prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]
Jian Pei: Big Data Analytics -- Classification 48
Comparing Two Classifiers
Figure from [Tan, Steinbach, Kumar]
2014-05-08
9
Jian Pei: Big Data Analytics -- Classification 49
Cost-Sensitive Learning
• In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection
• Using a cost matrix PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes -1 100 Class=No 1 0
Jian Pei: Big Data Analytics -- Classification 50
Sampling for Imbalance Classes
• Consider a data set containing 100 positive examples and 1,000 negative examples
• Undersampling: use a random sample of 100 negative examples and all positive examples – Some useful negative examples may be lost – Run undersampling multiple times, use the ensemble of
multiple base classifiers – Focused undersampling: remove negative samples that
are not useful for classification, e.g., those far away from the decision boundary
Jian Pei: Big Data Analytics -- Classification 51
Oversampling
• Replicate the positive examples until the training set has an equal number of positive and negative examples
• For noisy data, may cause overfitting
Jian Pei: Big Data Analytics -- Classification 52
Significance Tests • Are two algorithms different in effectiveness?
– The null hypothesis: there is NO difference – The alternative hypothesis: there is a difference – B is better than A
(the baseline method) • Matched pair experiments: the rankings that are compared
are based on the same set of queries for both algorithms • Possible errors of significant tests
– Type I: the null hypothesis is rejected when it is true – Type II: the null hypothesis is accepted when it is false
• The power of a hypothesis test: the probability that the test will reject the null hypothesis correctly – Reducing the type II errors
Jian Pei: Big Data Analytics -- Classification 53
Procedure of Comparison • Using a set of data sets • Procedure
– Compute the effectiveness measure for every data set – Compute a test statistic based on a comparison of the effectiveness
measures for each data set • E.g., the t-test, the Wilcoxon signed-rank test, and the sign test
– Compute a P-value: the probability that a test statistic value at least that extreme could be observed if the null hypothesis were true
– The null hypothesis is rejected if the P-value ≤ α, where α is the significance level which is used to minimize the type I errors
• One-sided (one-tailed) tests: whether B is better than A (the baseline method) – Two-sided tests: whether A and B are different – the P-value is
doubled
Jian Pei: Big Data Analytics -- Classification 54
Distribution of Test Statistics
2014-05-08
10
Jian Pei: Big Data Analytics -- Classification 55
T-test
• Assuming data values are sampled from normal distributions – In a matched pair experiment, assuming the difference
between the effectiveness values is a sample from a normal distribution
• The null hypothesis: the mean of the distribution of difference is 0
– B – A is the mean of the differences, σB – A is the standard deviation of the differences
NABtAB−
−=σ
∑=
−=N
ii xx
N 1
22 )(1σ
Jian Pei: Big Data Analytics -- Classification 56
Example
33.21.294.21
=
=
=−
−
t
AB
ABσ
P-value = 0.02 significant at a level of σ = 0.05 – the null hypothesis can be rejected
Jian Pei: Big Data Analytics -- Classification 57
Issues in T-test • Data is assumed to be sampled from normal
distributions – Generally inappropriate for effectiveness measures – However, experiments showed that t-test produces very
similar results to the randomization test which does not assume any distribution (the most powerful nonparametric test)
• T-test assumes that the evaluation data is measured on an interval scale – Effectiveness measures are ordinal – the magnitude of
the differences are not significant – Use the Wilcoxon signed-rank test and the sign test,
which make less assumption about the effectiveness measure, but are less powerful
Jian Pei: Big Data Analytics -- Classification 58
Wilcoxon Signed-Rank Test • Assumption: the differences between the effectiveness
values can be ranked, but the magnitude is not important
– Ri is a signed-rank, N is the number of non-zero differences • Procedure
– The differences are sorted by their absolute values increasing order – Differences are assigned rank values (ties are assigned the average
rank) – The rank values are given the sign of the original difference
• The null hypothesis: the sum of the positive ranks will be the same as the sum of the negative ranks
∑=
=N
iiRw
1
Jian Pei: Big Data Analytics -- Classification 59
Example The non-zero differences in rank order of absolute value: 2, 9, 10, 24, 25, 25, 41, 60, 70 The signed ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 w = 35 P-value = 0.025 significant at a level of σ = 0.05 – the null hypothesis can be rejected
Jian Pei: Big Data Analytics -- Classification 60
Sign Test
• Completely ignore the magnitude of the differences – In practice, we may require that a 5-10%
difference is needed to be considered as different
• The null hypothesis: P(B > A) = P(A > B) = ½ • Sum up the number of pairs B > A
2014-05-08
11
Jian Pei: Big Data Analytics -- Classification 61
Example 7 pairs out of 10 B > A P-value = 0.17 – the probability that we observe 7 successes out of 10 trials where the probability of success is 0.5 Cannot reject the null hypothesis
Jian Pei: Big Data Analytics -- Classification 62
Intuition – Bayesian Classification
• More hockey fans in Canada than in US – Which country is Tom, a hockey ball fan, from? – Predicting Canada has a better chance to be right
• Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians
• P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan
• Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada
Jian Pei: Big Data Analytics -- Classification 63
Bayes Theorem
• Find the maximum a posteriori (MAP) hypothesis
– Require background knowledge – Computational cost
)()()|()|(
DPhPhDPDhP =
)()|(max)()()|(max)|(max
hPhDPDP
hPhDPDhPh
Hh
HhHhMAP
∈
∈∈
=
=≡
Jian Pei: Big Data Analytics -- Classification 64
Naïve Bayes Classifier
• Assumption: attributes are independent • Given a tuple (a1, a2, …, an), predict its
class as
– : the value of x that maximizes f(x) • Example:
∏=
=
jiji
i
iini
CaPCP
CPCaaaPC
)|()(maxarg
)()|,,,(maxarg 21 …
)(maxarg xf3maxarg 2
}3,2,1{−=
−∈x
x
Jian Pei: Big Data Analytics -- Classification 65
Example: Training Dataset
Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes
Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007
Probability of Infrequent Values
• (outlook = Sunny, temp = high, humid = low, wind = weak)?
• P(humid = low) = 0
Jian Pei: Big Data Analytics -- Classification 66
Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
2014-05-08
12
Smoothing
• Suppose an attribute has n different values: a1, …, an
• Assume a small enough value ε > 0 • Let Pi be the frequency of ai,
Pi = # tuples having ai / total # of tuples • Estimate
Jian Pei: Big Data Analytics -- Classification 67
P (ai) = ✏+1� n✏
nPi
Handling Continuous Attributes
• Discretization • Probability density estimation
Jian Pei: Big Data Analytics -- Classification 68
Density Estimation
• Let and be the mean and variance of all samples of class Cj, respectively
Jian Pei: Big Data Analytics -- Classification 69
P (Xi = xi|Cj) =1p
2⇡�ij
e
� (xi
�µ
ij
)2
2�2ij
µij �2ij
Characteristics of Naïve Bayes
• Robust to isolated noise points – Such points are averaged out in probability
computation • Insensitive to missing values • Robust to irrelevant attributes
– Distributions on such attributes are almost uniform
• Correlated attributes degrade the performance
Jian Pei: Big Data Analytics -- Classification 70
Bayes Error Rate
• The error rate of the ideal naïve Bayes classifier
Jian Pei: Big Data Analytics -- Classification 71
Err =
x̂Z
0
P (Crocodile | X)dX +
1Z
x̂
P (Alligator | X)dX
Jian Pei: Big Data Analytics -- Classification 72
Pros and Cons
• Pros – Easy to implement – Good results obtained in many cases
• Cons – A (too) strong assumption: independent
attributes • How to handle dependent/correlated
attributes? – Bayesian belief networks
2014-05-08
13
Jian Pei: Big Data Analytics -- Classification 73
Bayesian Networks
• Bayesian belief network allows a subset of the variables conditionally independent
• A graphical model of causal relationships – Represents dependency among the variables – Gives a specification of joint probability
distribution
X Y
Z P
Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P no loops or cycles
Jian Pei: Big Data Analytics -- Classification 74
Bayesian Belief Network: Example
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Family History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
The conditional probability table (CPT) for the variable LungCancer: Show the conditional probability for each possible combination of its parents
∏
==
n
iZParents iziPznzP
1))(|(),...,1(
Jian Pei: Big Data Analytics -- Classification 75
Training Bayesian Networks
• Given both the network structure and all variables observable: learn only the CPTs
• Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning
• Network structure unknown, all variables observable: search through the model space to reconstruct graph topology
• Unknown structure, all hidden variables: no good algorithms known for this purpose
Jian Pei: Big Data Analytics -- Classification 76
Associative Classification
• Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label
• Build classifier – Organize rules according to decreasing
precedence based on confidence and support • Classification
– Use the first matching rule to classify an unknown case
Jian Pei: Big Data Analytics -- Classification 77
Associative Classification Methods
• CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) – Mine association possible rules in the form of
• Cond-set (a set of attribute-value pairs) à class label
– Build classifier: Organize rules according to decreasing precedence based on confidence and then support
• CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) – Classification: Statistical analysis on multiple rules
Jian Pei: Big Data Analytics -- Classification 78
CMAR – Model Generation
• Classification based on Multiple Association Rules • Efficiency: Uses an enhanced FP-tree that
maintains the distribution of class labels among tuples satisfying each frequent itemset
• Rule pruning whenever a rule is inserted into the tree – Given two rules, R1 and R2, if the antecedent of R1 is
more general than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned
– Prune rules where the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance
2014-05-08
14
Jian Pei: Big Data Analytics -- Classification 79
CMAR – Classification
• Classification based on generated/pruned rules
• If only one rule satisfies tuple X, assign the class label of the rule
• If a rule set S satisfies X, CMAR – Divide S into groups according to class labels – Use a weighted χ2 measure to find the strongest
group of rules, based on the statistical correlation of rules within a group
– Assign X the class label of the strongest group Jian Pei: Big Data Analytics -- Classification 80
Classification by Aggregating Emerging Patterns • Emerging pattern (EP): A pattern frequent in
one class of data but infrequent in others – Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no”
– Rule: age<=30 à buys computer • G. Dong & J. Li. Efficient mining of emerging
patterns: discovering trends and differences. In KDD’99
Jian Pei: Big Data Analytics -- Classification 81
How to Mine Emerging Patterns?
• Border differential – Max-patterns in D1 w.r.t. min_sup=90% – Max-patterns in D2 w.r.t. min_sup=10% – X is a pattern covered by a max-pattern in D1 but not by
a max-pattern in D2 à X is an emerging pattern • Method
– Mine max-patterns in D1 and D2, respectively – Compare the two sets of borders, find the “maximal”
patterns that are frequent in D1 and infrequent D2
Jian Pei: Big Data Analytics -- Classification 82
Instance-based Methods
• Instance-based learning – Store training examples and delay the processing until a
new instance must be classified (“lazy evaluation”) • Typical approaches
– K-nearest neighbor approach • Instances represented as points in an Euclidean space
– Locally weighted regression • Construct local approximation
– Case-based reasoning • Use symbolic representations and knowledge-based inference
Jian Pei: Big Data Analytics -- Classification 83
The K-Nearest Neighbor Method
• Instances are points in an n-D space • The k-nearest neighbors (KNN) in the
Euclidean distance – Return the most common value among the k
training examples nearest to the query point • Discrete-/real-valued target functions
. _
+ _ xq
+
_ _ +
_
_
+
Jian Pei: Big Data Analytics -- Classification 84
KNN Methods
• For continuous-valued target functions, return the mean value of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm – Give greater weights to closer neighbors
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality – Distance could be dominated by irrelevant attributes – Axes stretch or elimination of the least relevant attributes
wd xq xi
≡ 12( , )
2014-05-08
15
Jian Pei: Big Data Analytics -- Classification 85
Case-based Reasoning
• Lazy evaluation + analysis of similar instances
• Methodology – Instances represented by rich symbolic
descriptions (e.g., function graphs) – Combine multiple retrieved cases – Tight coupling between case retrieval,
knowledge-based reasoning, and problem solving
Jian Pei: Big Data Analytics -- Classification 86
Lazy vs. Eager Learning
• Efficiency: lazy learning uses less training time but more predicting time
• Accuracy – Lazy method effectively uses a richer hypothesis
space – Eager: must commit to a single hypothesis that
covers the entire instance space
Jian Pei: Big Data Analytics -- Classification 87
Artificial Neural Networks
• (To some extent) simulating biological neural networks
• Basic mechanisms – Perceptrons – Multilayer networks
• Essential algorithm: BACKPROPAGATION
Jian Pei: Big Data Analytics -- Classification 88
Perceptrons
• Input: a vector of real-values • Calculate a linear combination of inputs
– Output 1 if the result is positive, -1 otherwise
Σ x1 x2
xn
.
.
.
x0=1 w1
w2
wn
w0
∑=
n
iii xw
0⎪⎩
⎪⎨⎧
−
>= ∑
=
otherwise
xwifon
iii
1
010
Jian Pei: Big Data Analytics -- Classification 89
Why Perceptrons?
• A perceptron is a hyperplane decision surface
• Perceptrons represent all primitive Boolean functions – AND, OR, …
+ +
+ + +
- -
-
Linearly separable
+
+ -
-
Linearly inseparable
AND: w0=-0.8 w1=w2=0.5 OR: w0=-0.3 w1=w2=0.5
Jian Pei: Big Data Analytics -- Classification 90
Training Perceptrons
• Begin with random weights • Iteratively apply the perceptron to each
training example, modify the weights whenever it misclassifies an example – wißwi+Δwi – Δwi = η(t-o)xi – t: the target output for the current example – o: the output generated by the perceptron – η: a positive constant called learning rate
2014-05-08
16
Jian Pei: Big Data Analytics -- Classification 91
Why Does the Training Work?
• The training example is correctly classified – (t-o) = 0 à Δwi = 0
• The target output is +1 and the perceptron outputs –1 – (t-o) = 2 – If xi > 0, increasing wi will increase the output – If xi < 0, decreasing wi will increase the output
wiß wi+Δwi
Δwi = η(t-o)xi
Jian Pei: Big Data Analytics -- Classification 92
The Sigmoid Unit
• Similar to perceptron, except for the output function
• σ: the sigmoid or logistic function
Σ x1 x2
xn
.
.
.
x0=1 w1
w2
wn
w0
∑=
n
iii xw
0 neteneto
−+==11)(σ
Jian Pei: Big Data Analytics -- Classification 93
Multilayer Networks
• Each layer has some perceptrons/sigmoid units
• A unit connects to all units in neighbor layers
Jian Pei: Big Data Analytics -- Classification 94
Backpropagation Algorithm
• Training a multilayer network • Create a feed-forward network • Initialize all network weights to small random
numbers (e.g., -0.5 to 0.5) • Until the termination condition is met, do
– For each training example • Propagate the input forward through the network • Propagate the errors backward through the network • Update each network weight
Jian Pei: Big Data Analytics -- Classification 95
Termination Conditions
• A fixed number of iterations • Once the error on the training examples is
below some threshold • Once the error on the test set meets some
criterion
Jian Pei: Big Data Analytics -- Classification 96
When Are ANNs Good?
• The training data are noisy and complex – Example: sensor data, image data
• More symbolic representations are often used – Similar to the capability of decision trees
2014-05-08
17
Jian Pei: Big Data Analytics -- Classification 97
Appropriate Problems for ANNs
• Many attributes • Target function may be discrete- or real-
valued, or a vector of multiple attributes • Errors in training examples • Long training time is acceptable • Fast classification time is required • Understandability is unimportant
Jian Pei: Big Data Analytics -- Classification 98
Support Vector Machines (SVM)
Support Vectors
Small Margin Large Margin
Jian Pei: Big Data Analytics -- Classification 99
Linear SVM
• Given a set of points with label
• The SVM finds a hyperplane separating the positive and negative samples
nix ℜ∈
}1,1{yi −∈
Jian Pei: Big Data Analytics -- Classification 100
Separate Samples by Projection
• For linearly inseparable data, project the data to high dimensional space where it is linearly separable
-1 0 +1
+ + -
(1,0) (0,0)
(0,1) +
+ -
Jian Pei: Big Data Analytics -- Classification 101
Non-linear SVM
• Learn a hyperplane • Use quadratic programming techniques • Using kernels can learn very complex
functions
Jian Pei: Big Data Analytics -- Classification 102
Non-linear SVM: An Example
2014-05-08
18
Jian Pei: Big Data Analytics -- Classification 103
Errors in Classification
• Bias: the difference between the real class boundary and the decision boundary of a classification model
• Variance: variability in the training data set • Intrinsic noise in the target class: the target
class can be non-deterministic – instances with the same attribute values can have different class labels
Jian Pei: Big Data Analytics -- Classification 104
Bias
Figure from [Tan, Steinbach, Kumar]
Jian Pei: Big Data Analytics -- Classification 105
One or More?
• What if a medical doctor is not sure about a case? – Joint-diagnosis: using a group of doctors carrying
different expertise – Wisdom from crowd is often more accurate
• All eager learning methods make prediction using a single classifier induced from training data – A single classifier may have low confidence in some
cases • Ensemble methods: construct a set of base
classifiers and take a vote on predictions in classification
Jian Pei: Big Data Analytics -- Classification 106
Ensemble Classifiers Original
Training data
....D1 D2 Dt-1 Dt
D
Step 1:Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:Build Multiple
Classifiers
C*Step 3:
CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))
Figure from [Tan, Steinbach, Kumar]
Jian Pei: Big Data Analytics -- Classification 107
Why May Ensemble Method Work?
• Suppose there are two classes and each base classifier has an error rate of 35%
• What if we use 25 base classifiers? – If all base classifiers are identical, the ensemble
error rate is still 35% – If base classifiers are independent, the
ensemble makes a wrong prediction only if more than half of the base classifiers are wrong
∑=
− =⎟⎟⎠
⎞⎜⎜⎝
⎛25
13
25 06.065.035.025
i
ii
iJian Pei: Big Data Analytics -- Classification 108
Ensemble Error Rate
Figure from [Tan, Steinbach, Kumar]
2014-05-08
19
Jian Pei: Big Data Analytics -- Classification 109
Ensemble Classifiers – When?
• The base classifiers should be independent of each other
• Each base classifier should do better than a classifier that performs random guessing
Jian Pei: Big Data Analytics -- Classification 110
How to Construct Ensemble?
• Manipulating the training set: derive multiple training sets and build a base classifier on each
• Manipulating the input features: use only a subset of features in a base classifier
• Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote
• Manipulating the learning algorithm, e.g., using different network configuration in ANN
Jian Pei: Big Data Analytics -- Classification 111
Bootstrap
• Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement
• If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632
• Use the tuples not in T’ as the test set
Jian Pei: Big Data Analytics -- Classification 112
Bagging • Run bootstrap k times to obtain k base classifiers • A test instance is assigned to the class that
receives the highest number of votes • Strength: reduce the variance of base classifiers –
good for unstable base classifiers – Unstable classifiers: sensitive to minor perturbations in
the training set, e.g., decision trees, associative classifiers, and ANN
• For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller
• Less overfitting on noisy data
Jian Pei: Big Data Analytics -- Classification 113
Boosting • Assign a weight to each training example
– Initially, each example is assigned a weight 1/n • Weights can be used in one of the following ways
– Weights as a sampling distribution to draw a set of bootstrap samples from the original training set
– Weights used by a base classifier to learn a model biased towards heavier examples
• Adaptively change the weight at the end of each boosting round – The weight of an example correctly classified decreases – The weight of an example incorrectly classified
increases • Each round generates a base classifier
Jian Pei: Big Data Analytics -- Classification 114
Critical Design Choices in Boosting
• How the weights of the training examples are updated at the end of each boosting round?
• How the predictions made by base classifiers are combined?
2014-05-08
20
Jian Pei: Big Data Analytics -- Classification 115
AdaBoost
• Each base classifier carries an importance score related to its error rate – Error rate
– wi: weight, I(p) = 1 if p is true – Importance score
( )∑=
≠=N
jjjiji yxCIw
N 1)(1
ε
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
i
ii ε
εα
1ln21
Jian Pei: Big Data Analytics -- Classification 116
How Does Importance Score Work?
Jian Pei: Big Data Analytics -- Classification 117
Weight Adjustment in AdaBoost
– If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n
• The ensemble error rate is bounded
∑ =
⎪⎩
⎪⎨⎧
≠
==
+
−+
i
)1(
)()1(
1 factor,ion normalizat theis where
)( ifexp)( ifexp
jij
iij
iij
j
jij
i
wZ
yxCyxC
Zww
j
j
α
α
∏ −≤i
iiensemblee )1( εε
Jian Pei: Big Data Analytics -- Classification 118
Random Forests
Figure from [Tan, Steinbach, Kumar]
Jian Pei: Big Data Analytics -- Classification 119
Random Forests
• Using decision trees as base classifiers • Each tree uses only a subset of features in
classification • Bagging can be used to generate training
sets for decision trees
Jian Pei: Big Data Analytics -- Classification 120
Forest-RI (Random Input)
• Each tree is built using a random subset of features
• A tree is grown to its entirety without pruning • The smaller the sets of features used by decision
trees, the less correlated the trees • The larger the sets of features used by decision
trees, the more accurate the trees • Tradeoff: m = log2d + 1
– d: number of features in the training set • If d is too small, it is hard to obtain independent
feature sets
2014-05-08
21
Jian Pei: Big Data Analytics -- Classification 121
Forest-RC
• When the number of features in the training set is too small, use linear combinations of features to generate new features to build decision trees
• Randomly select a set of features L • Linearly combine features in L using coefficients
generated from a uniform distribution in the range of [-1, 1]
• At each node, m such randomly combined new features are generated, and the best of them is used to split the node
Prediction and Time • “Prediction is very difficult, especially about the
future.” — Niels Bohr (1885 - 1962)
• “An economist is an expert who will know tomorrow why the things he predicted yesterday didn’t happen today.”
— Laurence J. Peter (1919 - 1988) • “If something anticipated arrives too late it finds us
numb, wrung out from waiting, and we feel – nothing at all. The best things arrive on time.”
— Dorothy Gilman, A New Kind of Country, 1978
Jian Pei: Big Data Analytics -- Classification 122
Early Diagnosis
• A retrospective study of the clinical data of infants admitted to a neonatal intensive care unit found that the infants, who were diagnosed with sepsis disease, had abnormal heartbeat time series patterns 24 hours preceding the diagnosis
• Monitoring the heartbeat time series data and classifying the time series data as early as possible may lead to early diagnosis and effective therapy
Jian Pei: Big Data Analytics -- Classification 123
Online Traffic Classification
• By only observing the first five packages of a TCP connection, the application associated with the traffic flow can be classified accurately
• The applications of online traffic can be identified without waiting for the TCP flow to end
Jian Pei: Big Data Analytics -- Classification 124
Objectives in Classification
• Objective function for optimization: quality – how well does the model learned approach the latent mechanism? – Accuracy – Recall – F-measure – Many other measures
• Time is not a first-class citizen in the problem setting!
Jian Pei: Big Data Analytics -- Classification 125
Early Classification – Problem
• Data: time series / temporal sequences with class labels
• Task: construct a model that can produce class label prediction as early as possible – How can we measure the earliness? – Expected classification time (other measures are
possible) • Constraint: classification quality — should be at
a satisfactory level – How can we constrain the quality? An important
issue
Jian Pei: Big Data Analytics -- Classification 126
2014-05-08
22
The Price on Time
• In the classical classification problem, time is not considered as a factor
• In early classification, time has a price — a model becomes more and more expensive as it consumes longer and longer prefixes
Jian Pei: Big Data Analytics -- Classification 127
Time
in early classification
in traditional classificationThe complexity of the model concerned
The complexity of the model conerned
Example
Jian Pei: Big Data Analytics -- Classification 128
−5
0
5
−5
0
5
−5
0
5
−5
0
5
0
1
2
0
1
2
0
1
2
0 2 4 6 8 10 120
1
2
Class Star Class Dimond
Feature A
Feature B
Challenges
• How to incorporate time in classification? • How to avoid overfitting in time? • How to make a balanced tradeoff between time
and quality? • Simplification
– All time series are normalized and aligned – All time series have the same length – For each time moment, we can get a snapshot of
the distribution of the time series — each time series is a point
Jian Pei: Big Data Analytics -- Classification 129
Example
Jian Pei: Big Data Analytics -- Classification 130
Find the earliest classification time
t2t1 t3Time
A Nearest Neighbor Idea
• Exploiting the extreme of local features — Using 1NN to classify
• A time series can be used reliably in 1NN classification if its local neighborhood belongs to the same class – For a time series s, the reverse 1NN’s of s use s in
classification — s* is a reverse1NN of s if s is the1NN of s* – The local neighborhood of s can be captured by the
reverse 1NN’s of s – The local neighborhood of s is pure if all time series in
the neighborhood belong to the same class as s does
Jian Pei: Big Data Analytics -- Classification 131
Impure
s s
Pure
Minimum Prediction Length
• The length of the prefix that can be trusted to classify other time series as accurate as the full length
Jian Pei: Big Data Analytics -- Classification 132
!!L=50!
MPL=33!
MPL=33!
MPL=41!
MPL=41!
!!!!!a!!!!b!!!!!c!!!!d!
NN(s)=a[1,1]!!!!!(<33)!!!
NN(s)=d[1,41]!(>=41),!RED!CLASS!!!
2014-05-08
23
Extracting Interpretable Features
• Using features can improve generality of classifiers and help to avoid overfitting
• Features can characterize the latent data generating mechanism
• In some applications, such as medical diagnosis, users want to know not only what (i.e., the class labels), but also why (i.e., the manifesting features explaining the decision process)
Jian Pei: Big Data Analytics -- Classification 133
Shapelets as Features
• A shapelet is duple: a time series segment and a distance threshold (s,δ)
• Optimization objective — best information gain
Jian Pei: Big Data Analytics -- Classification 134
!Diamond! !Star!
Feature B
−5
0
5
−5
0
5
−5
0
5
−5
0
5
0
1
2
0
1
2
0
1
2
0 2 4 6 8 10 120
1
2
Class Star Class Dimond
Feature A
Feature B
Local Distinctive Shapelets
• Finding shapelets shared by some times series that belong to the same class but not the other time series
• Finding shapelets that appear early in time series
• Rank features according to their precision, recall, and earliness – Use a minimum threshold to avoid overfitting – More sophisticated methods may be used
Jian Pei: Big Data Analytics -- Classification 135
Early Classification Is Accurate
Jian Pei: Big Data Analytics -- Classification 136
0 20 40 60 80
100
ECGGunPoint
CBFSyn-Con
WaferOlive
Two-Patterns
Accu
racy
Data sets
EDSC-CHEEDSC-KDE
ECTSFULL1NN
Earliness
Jian Pei: Big Data Analytics -- Classification 137
0 20 40 60 80
100
ECGGunPoint
CBFSyn-Con
WaferOlive
Two-PatternsPerc
enta
ge o
f Ave
. Pre
dict
ion
Len.
Data sets
EDSC-CHEEDSC-KDE
ECTSFULL1NN
Features (Gun-Point Data Set)
Jian Pei: Big Data Analytics -- Classification 138
2014-05-08
24
Summary • Early classification is useful in a few important
applications – Medical and health informatics applications – Intrusion detection – Security and safety – Some of our algorithms are being tested for astronaut
tiredness prediction in a spaceship project • Early prediction may be connected to many frontiers
of data mining research – Subspace feature extraction – Concise and non-redundant feature representation
• The problem has not been thoroughly investigated
Jian Pei: Big Data Analytics -- Classification 139
Open Problems
• How to balance earliness and classification quality? – Progressive and interactive early classification?
• How to “rescue” mis-classified samples over time? – Re-classification?
• Theoretical foundation for early classification? • Earliness aware data mining? • A more general model for cost-sensitive
learning? Jian Pei: Big Data Analytics -- Classification 140