ML410C Projects in health informatics – Project and information management Data Mining
description
Transcript of ML410C Projects in health informatics – Project and information management Data Mining
ML410C
Projects in health informatics – Project and information management
Data Mining
Last time…• Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science and mathematics
• Some (basic) data mining tasks
The Knowledge Discovery ProcessKnowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-54 (1996)
CRISP-DM: CRoss Industry Standard Process for Data Mining
Shearer C., “The CRISP-DM model: the new blueprint for data mining”, Journal of Data Warehousing 5 (2000) 13-22 (see also www.crisp-dm.org)
Today
DATE TIME ROOM TOPIC
MONDAY2013-09-09
10:00-11:45 502 Introduction to data mining
WEDNESDAY2013-09-11
09:00-10:45 501 Decision trees, rules and forests
FRIDAY2013-09-13
10:00-11:45 Sal C Evaluating predictive models and tools for data mining
• What is classification
• Overview of classification methods
• Decision trees
• Forests
Today
Predictive data mining
Our task
• Input: data representing objects that have been assigned labels
• Goal: accurately predict labels for new (previously unseen) objects
An example: email classificationFeatures (attributes)
Exam
ples
(obs
erva
tions
)
Ex. Allcaps
No. excl. marks
Missingdate
No. digits in From:
Image fraction
Spam
e1 yes 0 no 3 0 yes
e2 yes 3 no 0 0.2 yes
e3 no 0 no 0 1 no
e4 no 4 yes 4 0.5 yes
e5 yes 0 yes 2 0 no
e6 no 0 no 0 0 no
Decision tree
Spam = yesSpam = no
Spam = yes
Rules
Spam = noSpam = no
Spam = yes Spam = yes Spam = no
Forests
Classification
• What is the class of the following e-mail?
– No Caps: Yes
– No. excl. marks: 0
– Missing date: Yes
– No. digits in From: 4
– Image fraction: 0.3
Classification
• What is classification?• Issues regarding classification and prediction• Classification by decision tree induction• Classification by Naïve Bayes classifier• Classification by Nearest Neighbor• Classification by Bayesian Belief networks
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute
– uses the model for classifying new data
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification
• Credit approval– A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
– The history of past customers is used to train the classifier
– The classifier provides rules, which identify potentially
reliable future customers
Why Classification? A motivating application
• Credit approval– Classification rule:
• If age = “31...40” and income = high
• then credit_rating = excellent
– Future customers
• Paul: age = 35, income = high excellent credit rating
• John: age = 20, income = medium fair credit rating
Why Classification? A motivating application
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision
trees, or mathematical formulas
Classification—A Two-Step Process
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test samples is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that
are correctly classified by the model
• Test set is independent of training set, otherwise over-
fitting will occur
Classification Process (1): Model Construction
TrainingData
NAME LDL Glu H/ATTACKMike normal 3 noMary high 8 yesBill normal 7 yesJim high 5 yesDave normal 4 no
ClassificationAlgorithms
IF LDL = ‘high’OR Gluc > 6 mmol/litTHEN Heart attack = ‘yes’
Classifier(Model)
Classification Process (2): Use the Model in Prediction
Classifier
TestingData
NAME LDL Glu H/ATTACKTom normal 5.5 noMellisa low 5 noGeorge high 7.5 yesJoseph high 5 yes
Unseen Data
(Jeff, high, 7.5)
Heart attack?
Accuracy=?
Supervised vs. Unsupervised Learning
• Supervised learning (classification)– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations– New data is classified based on the training set
• Unsupervised learning (clustering)– The class labels of training data is unknown– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Issues regarding classification and prediction: Evaluating Classification Methods
• Predictive accuracy• Speed
– time to construct the model– time to use the model
• Robustness– handling noise and missing values
• Scalability– efficiency in disk-resident databases
• Interpretability: – understanding and insight provided by the model
• Goodness of rules (quality)– decision tree size– compactness of classification rules
Classification by Decision Tree Induction• Decision tree
– A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases– Tree construction
• At start, all the training examples are at the root• Partition examples recursively based on selected attributes
– Tree pruning• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Example
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Algorithm for Decision Tree Induction• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in advance)– Samples are partitioned recursively based on selected attributes– Test (split) attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf– There are no samples left
Algorithm for Decision Tree Induction (pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)1. create a node N2. If all samples are of the same class C then label N with C; terminate;3. If A is empty then label N with the most common class C in S (majority
voting); terminate;4. Select aA, with the highest information gain; Label N with a;5. For each value v of a:
a. Grow a branch from N with condition a=v;b. Let Sv be the subset of samples in S with a=v;c. If Sv is empty then attach a leaf labeled with the most common class in S;d. Else attach the node generated by GenDecTree(Sv, A-a)
Attribute Selection Measure: Information Gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
- where Ci, D denotes the set of tuples that belong to class Ci
Expected information (entropy) needed to classify a tuple in D:
- where m is the number of classes
)(log)( 21
i
m
ii ppDInfo
Attribute Selection Measure: Information Gain
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(||||
)(1
j
v
j
jA DI
DD
DInfo
(D)InfoInfo(D)Gain(A) A
Attribute Selection: Information Gaing Class P: buys_computer = “yes”g Class N: buys_computer = “no”
age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971
694.0)2,3(145
)0,4(144)3,2(
145)(
I
IIDInfoage
048.0)_(151.0)(029.0)(
ratingcreditGainstudentGainincomeGain
246.0)()()( DInfoDInfoageGain ageage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
940.0)145(log
145)
149(log
149)5,9()( 22 IDInfo
)(log)( 21
i
m
ii ppDInfo
)(||||
)(1
j
v
j
jA DI
DD
DInfo
Splitting the samples using age
income student credit_rating buys_computerhigh no fair nohigh no excellent nomedium no fair nolow yes fair yesmedium yes excellent yes
income student credit_rating buys_computerhigh no fair yeslow yes excellent yesmedium no excellent yeshigh yes fair yes
income student credit_rating buys_computermedium no fair yeslow yes fair yeslow yes excellent nomedium yes fair yesmedium no excellent no
age?<=30
30...40
>40
labeled yes
Gini index• If a data set D contains examples from n classes, gini
index, gini(D) is defined as
- where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
n
jp jDgini1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
Gini index
• Reduction in Impurity:
• The attribute that provides the smallest ginisplit(D) (or
the largest reduction in impurity) is chosen to split
the node
)()()( DginiDginiAgini A
Gini index (CART, IBM IntelligentMiner)
Example: • D has 9 tuples in buys_computer = “yes” and 5 in “no”
• Suppose that attribute “income” partitions D into 10 records (D1: {low, medium}) and 4 records (D2: {high}).
459.0145
1491)(
22
Dgini
n
jp jDgini1
21)(
Gini index
• Then:
= 0.45
and gini{medium,high} = 0.30
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the
possible split values
)(144)(
1410)( 21},{ DGiniDGiniDgini mediumlowincome
Comparing Attribute Selection Measures
• The two measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor test sets that result in equal-sized partitions and
purity in both partitions
Overfitting due to noise
Decision boundary is distorted by noise point
Overfitting due to insufficient samples
Why?
Overfitting due to insufficient samples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
Occam’s Razor• Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model• Therefore, one should include model complexity when evaluating a model
“entia non sunt multiplicanda praeter ecessitatem”,
which translates to:
“entities should not be multiplied beyond necessity”.
Pros and Cons of decision trees
• Cons– Cannot handle complicated
relationship between features– simple decision boundaries– problems with lots of missing
data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number
of features
Some well-known decisiontree learning implementations
CART Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth
ID3 Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106
C4.5 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann
J48 Implementation of C4.5 in WEKA
Handling missing values
• Remove attributes with missing values
• Remove examples with missing values
• Assume most frequent value
• Assume most frequent value given a class
• Learn the distribution of a given attribute
• Find correlation between attributes
Handling missing values
Example A 1 … Class
e1 yes +
e2 no +
e3 yes -
e4 ? -
A1
yes no
e1 (w=1)e3 (w=1)e4 (w=2/3)
e2 (w=1)e4 (w=1/3)
k-nearest neighbor classifiers
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
k-nearest neighbors of a record x are data points that have the k smallest distance to x
k-nearest neighbor classification
• Given a data record x find its k closest points– Closeness: ?
• Determine the class of x based on the classes in the neighbor list– Majority vote– Weigh the vote according to distance
• e.g., weight factor, w = 1/d2
Characteristics of nearest-neighbor classifiers
• No model building (lazy learners)– Lazy learners: computational time in classification– Eager learners: computational time in model
building• Decision trees try to find global models, k-NN
take into account local information• K-NN classifiers depend a lot on the choice of
proximity measure
Condorcet’s jury theorem
• If each member of a jury is more likely to be right than wrong,
• then the majority of the jury, too, is more likely to be right than wrong
• and the probability that the right outcome is supported by a majority of the jury is a (swiftly) increasing function of the size of the jury,
• converging to 1 as the size of the jury tends to infinity
Condorcet, 1785
0
0.05
0.1
0.15
0.2
0.25
0.3
0 5 10 15 20
No. of jury members
Prob
abili
ty o
f err
or
Condorcet’s jury theorem
Random forests
Random forests (Breiman 2001) are generated by
combining two techniques:
• bagging (Breiman 1996)
• the random subspace method (Ho 1998)
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001
L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996
T. K. Ho. The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998
Bagging
A bootstrap replicate E’ of a set of examples E is created by randomly selecting n = |E| examples from E with replacement.
Ex. Other Bar Fri/Sat Hungry Guests Wait
e1 yes no no yes some yes
e2 yes no no yes full no
e3 no yes no no some yes
e4 yes no yes yes full yes
e5 yes no yes no none no
e6 no yes no yes some yes
Ex. Other Bar Fri/Sat Hungry Guests Wait
e2 yes no no yes full no
e2 yes no no yes full no
e3 no yes no no some yes
e4 yes no yes yes full yes
e4 yes no yes yes full yes
e6 no yes no yes some yes
Forests
• What is classification
• Overview of classification methods
• Decision trees
• Forests
Today
Next time
DATE TIME ROOM TOPIC
MONDAY2013-09-09
10:00-11:45 502 Introduction to data mining
WEDNESDAY2013-09-11
09:00-10:45 501 Decision trees, rules and forests
FRIDAY2013-09-13
10:00-11:45 Sal C Evaluating predictive models and tools for data mining