Sunita Sarawagi [email protected] Data mining and Machine Learning.
-
Upload
thomasina-page -
Category
Documents
-
view
227 -
download
1
Transcript of Sunita Sarawagi [email protected] Data mining and Machine Learning.
Sunita [email protected]
Data mining and Machine Learning
Data Mining Data mining is the process of semi-
automatically analyzing large databases to find useful patterns
Prediction based on past history Predict if a credit card applicant poses a good credit risk,
based on some attributes (income, job type, age, ..) and past history
Predict if a pattern of phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms: Classification
Given a new item whose class is unknown, predict to which class it belongs
Regression formulae Given a set of mappings for an unknown function, predict the
function result for a new parameter value
Data Mining (Cont.) Descriptive Patterns
Associations Find books that are often bought by “similar” customers. If
a new such customer buys one such book, suggest the others too.
Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer,
Clusters E.g. typhoid cases were clustered in an area surrounding a
contaminated well Detection of clusters remains important in detecting
epidemics
Data mining Data: of various shapes and sizes Patterns/Model: of various shapes and sizes
Abstraction of data into some understandable and useful Basic structure of data
Set of instances/objects/cases/rows/points/examples Each instance: fixed set of attributes/dimensions/columns
Continuous Categorical
Patterns: Express one attribute as a function of another:
Classification, regression Group together related instances: clustering, projection,
factorization, itemset mining
Classification
Given old data about customers and payments, predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New customer’s data
Good/bad
Labeled dataUnlabeled data
Training
Model
DeploymentClass label
Applications
Ad placement in search engines Book recommendation Citation databases: Google scholar, Citeseer Resume organization and job matching Retail data mining
Banking: loan/credit card approval predict good customers based on old customers
Customer relationship management: identify those who are likely to leave for a competitor.
Targeted marketing: identify likely responders to promotions
Machine translation Speech and handwriting recognition Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Applications (continued) Medicine: disease outcome, effectiveness of
treatments analyze patient disease history: find relationship between
diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis: identify new galaxies by searching for sub clusters
Image and vision: Object recognition from images Remove noise from images Identifying scene breaks
The KDD process Problem fomulation Data collection
subset data: sampling might hurt if highly skewed data feature selection: principal component analysis, heuristic
search
Pre-processing: cleaning name/address cleaning, different meanings (annual, yearly),
duplicate removal, supplying missing values Transformation:
map complex objects e.g. time series data to features e.g. frequency
Choosing mining task and mining method: Result evaluation and Visualization:
Knowledge discovery is an iterative process
Mining products
Data warehouse
Preprocessing utilities
Mining operations
VisualizationTools
Commercial Tools– SAS Enterprise miner – SPSS– IBM Intelligent miner– Microsoft SQL Server Data
mining services– Oracle data mining (ODM)
Extract data via ODBC •Sampling
•Attribute transformation
Scalable algorithms• association• classification• clustering• sequence mining
Free
Weka
Individual algorithms
Mining operations
Classification Regression Classification trees Neural networks Bayesian learning Nearest neighbour Radial basis
functions Support vector
machines Meta learning
methods Bagging,boosting
Clustering hierarchical EM density
based
Sequence mining Time series
similarity Temporal
patterns
Itemset mining Association rules Causality
Sequential classification Graphical models
Hidden Markov Models
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: (linear or any other polynomial) Decision tree classifier: divide decision space into
piecewise constant regions. Neural networks: partition by non-linear
boundaries Probabilistic/generative models Lazy learning methods: nearest neighbor Support vector machines: boundary to maximally
separate classes
Decision tree learning
Decision tree classifiers Widely used learning method Easy to interpret: can be re-represented as if-
then-else rules Approximates function by piece wise constant
regions Does not require any prior knowledge of data
distribution, works well on noisy data. Has been applied to:
classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment. lots and lots of other applications..
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teaching
Good
Age < 30
BadBad Good
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Note:Outlook is theForecast,no relation to Microsoftemail program
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Example Tree for “Play?”
Outlook
HumidityWindy
Topics to be covered Tree construction:
Basic tree learning algorithm Measures of predictive ability High performance decision tree construction: Sprint
Tree pruning: Why prune Methods of pruning
Other issues: Handling missing data Continuous class labels Effect of training size
Tree learning algorithms ID3 (Quinlan 1986) Successor C4.5 (Quinlan 1993) CART SLIQ (Mehta et al) SPRINT (Shafer et al)
Basic algorithm for tree building Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf? Yes Stop
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Selectioncriteria
Split criteria Select the attribute that is best for
classification. Intuitively pick one that best separates
instances of different classes. Quantifying the intuitive: measuring
separability: First define impurity of an arbitrary set S
consisting of K classes Smallest when consisting of only one class,
highest when all classes in equal number. Should allow computations in multiple stages.
1
Measures of impurity Entropy
Gini
k
iii ppSEntropy
1
log)(
k
iipSGini
1
21)(
Information gain
Information gain on partitioning S into r subsets Impurity (S) - sum of weighted impurity of each subset
0 p1
1E
ntro
py
r
jj
jr SEntropy
S
SSEntropySSSGain
11 )()()..,(
1 0
0.5
Gin
i
1
Information gain: example
S
K= 2, |S| = 100, p1= 0.6, p2= 0.4E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29
S1 S2
| S1 | = 70, p1= 0.8, p2= 0.2E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21
| S2| = 30, p1= 0.13, p2= 0.87
E(S2) = -0.13log0.13 - 0.87 log 0.87=.16
Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Which attribute to select?
witten&eibe
Example: attribute “Outlook”
“ Outlook” = “Sunny”:
“Outlook” = “Overcast”:
“Outlook” = “Rainy”:
Expected information for attribute:
bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]
bits 0)0log(0)1log(10)entropy(1,)info([4,0]
bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]
Note: log(0) is not defined, but we evaluate 0*log(0) as zero
971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0
witten&eibe
Computing the information gain
Information gain: (information before split) – (information
after split)
Information gain for attributes from weather data:
0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0
bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("
bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("
witten&eibe
Continuing to split
bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("
bits 020.0)Windy"gain("
witten&eibe
The final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further
witten&eibe
Preventing overfitting A tree T overfits if there is another tree T’
that gives higher error on the training data yet gives lower error on unseen data.
An overfitted tree does not generalize to unseen instances.
Happens when data contains noise or irrelevant attributes and training size is small.
Overfitting can reduce accuracy drastically: 10-25% as reported in Minger’s 1989 Machine learning
Example of over-fitting with binary data.
Compare error rates measured by
learn data large test set
Learn R(T) always decreases as tree grows (Q: Why?)
Test R(T) first declines then increases (Q: Why?)
Overfitting is the result tree of too much reliance on learn R(T)
Can lead to disasters when applied to new data
71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .20 .31
**10 .29 .30 9 .32 .34
7 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91
No.Terminal
NodesR(T) Rts(T)
Training Data Vs. Test Data Error Rates
Digit recognition dataset: CART book
Overfitting example Consider the case where a single attribute
xj is adequate for classification but with an error of 20%
Consider lots of other noise attributes that enable zero error during training
This detailed tree during testing will have an expected error of (0.8*0.2 + 0.2*0.8) = 32% whereas the pruned tree with only a single split on xj will have an error of only 20%.
Approaches to prevent overfitting
Two Approaches: Stop growing the tree beyond a certain point
Tricky, since even when information gain is zero an attribute might be useful (XOR example)
First over-fit, then post prune. (More widely used) Tree building divided into phases:
Growth phase Prune phase
Criteria for finding correct final tree size:
Three criteria: Cross validation with separate test data Statistical bounds: use all data for training but
apply statistical test to decide right size. (cross-validation dataset may be used to threshold)
Use some criteria function to choose best size Example: Minimum description length (MDL) criteria
Cross validation Partition the dataset into two disjoint parts:
1. Training set used for building the tree. 2. Validation set used for pruning the tree:
Rule of thumb: 2/3rds training, 1/3rd validation
Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data.
Starting bottom-up, prune nodes with error less than its children.
What if training data set size is limited? n-fold cross validation: partition training data into n parts D1,
D2…Dn. Train n classifiers with D-Di as training and Di as test instance. Pick average. (how?)
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a
conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Rule-based pruning Tree-based pruning limits the kind of
pruning. If a node is pruned all subtrees under it has to be pruned.
Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root.
On the validation set, independently prune tests from each rule to get the highest accuracy for that rule.
Sort rule by decreasing accuracy..
Regression trees Decision tree with continuous class labels: Regression trees approximates the
function with piece-wise constant regions. Split criteria for regression trees:
Predicted value for a set S = average of all values in S Error: sum of the square of error of each member of S
from the predicted average. Pick smallest average error.
Splits on categorical attributes: Can it be better than for discrete class labels? Homework.
Other types of trees Multi-way trees on low-cardinality
categorical data Multiple splits on continuous attributes
[Fayyad 93, Multi-interval discretization of continuous attributes]
Multi attribute tests on nodes to handle correlated attributes
multivariate linear splits [Oblique trees, Murthy 94]
Issues
Methods of handling missing values assume majority value take most probable path
Allowing varying costs for different attributes
Pros and Cons of decision trees
• Cons– Not effective for very high dimensional data where information about the class is spread in small ways over many correlated features
–Example: words in text classification
–Not robust to dropping of important features even when correlated substitutes exist in data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Intuitive
From Jiawei Han's slides
The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D
space. The nearest neighbor are defined in terms of
Euclidean distance. The target function could be discrete- or real-
valued. For discrete-valued, the k-NN returns the most
common value among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. .
_+
_ xq
+
_ _+
_
_
+
.
..
. .
Locally weighted regression: learn a new regression equation by weighting each
training instance based on distance from new instance
Radial Basis Functions
Other lazy learning methods
• Cons– Slow during application.– No feature selection.– Notion of proximity vague
• Pros+ Fast training
Bayesian learning Assume a probability model on generation of
data. Apply bayes theorem to find most likely class as:
Naïve bayes: Assume attributes conditionally independent given class value
Easy to learn probabilities by counting, one pass counting Useful in some domains e.g. text. Numeric attributes must be discretized
)(
)()|(max)|(max
dp
cpcdpdcpc jj
cj
c jj
n
iji
j
ccap
dp
cpc
j 1
)|()(
)(max
Bayesian belief network Find joint probability over set of variables
making use of conditional independence whenever known
Learning parameters hard when hidden units: use gradient descent / EM algorithms
Learning structure of network harder
b bb
a d
eC
ad ad ad ad
0.1 0.2 0.3 0.4
0.3 0.2 0.1 0.5Variable e independent
of d given b
Neural networks Useful for learning complex data like
handwriting, speech and image recognition
Neural networkClassification tree
Decision boundaries:
Linear regression
Pros and Cons of Neural Network
• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes
• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features
Conclusion: Use neural nets only if decision trees/NN fail.
Linear discriminants
Problem setting Given a binary classification problem with
points in d dimensions Training data n vectors with predictions of
the form: (x1,y1),…(xn,yn) Each y will take value 1 or -1 Goal is to learn a function of the form:
F(x) = w.x + b= w1x1+w2x2+…+wdxd + b
Linear regression
Developed for the case of real-valued y y = f(x) = w.x+w0
Rewrite as y = w.x with x-s padded with a 1.
Error:
Minimize error by differentiating wrt w Minimum reached at w=(X’X)-1X’Y
Fisher’s linear discriminant Find hyperplane (w,b) on which the
projection of the data is maximally separated.
Cost function:
Where i and i are the mean and standard deviation of the projected points w.x+b for all point x in class i , pi is fraction in class i.
The linear discriminate maximizes above separation when
w = (m1-m2)’S-1x where m1 is the mean of the x values along each class and S-1 is the covariance matrix of the data.
b = (m1-m2)’S-1(m1+m2) (mid-point of the two means on the linear discriminate)
Fisher’s discriminant
fi
fj
This maximizes separation between projected red and
black points average
Shortcomings Perceptron: ill-posed. several values of w
might yield the same zero error on training data
Support vector machines
Binary classifier: find hyper-plane providing maximum margin between vectors of the two classes
fi
fj
Support vector machines
Separators with larger margin will have smaller generalization error
Geometry of SVMs
fi
fj w
-b||w||
1/||w||
wx+b||w||
Linear separators Most complex real-world applications
require more than linear separators One way to get around the problem is to
represent the data in a transformed coordinate space on which linear separators can be learnt.
Example f(m1,m2,r)=Cm1m2/r2 is not linear but f(ln m1, ln m2, ln r) is.
Support Vector Machines
Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992)
Good generalization performance OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims)
Requires tuning: which kernel, what parameters?
Several freely available packages: SVMTorch
Locally weighted regression
Learn a new regression equation by weighting each training instance based on distance from new instance
Locally Weighted Regression
Regression equation
Find wi-s so as to minimize error:
Construct an explicit approximation to f over a local region surrounding query instance xq.
The target function f is approximated near xq using the linear function:
minimize the squared error: distance-decreasing weight K
( ) ( ) ( )f x w w a x wnan x 0 1 1
Dx
xfxfDE 2))(ˆ)((21)(
E xq f x f xx k nearest neighbors of xq
K d xq x( ) ( ( ) ( ))_ _ _ _
( ( , ))
12
2
Feature subset selection Embedded: in the mining algorithm Filter: features in advance Wrapper: generate candidate features and
test on black box mining algorithm high cost but provide better algorithm dependent
features
Meta learning methods No single classifier good under all cases Difficult to evaluate in advance the
conditions Meta learning: combine the effects of the
classifiers Voting: sum up votes of component classifiers Combiners: learn a new classifier on the outcomes of
previous ones: Boosting: staged classifiers
Disadvantage: interpretation hard Knowledge probing: learn single classifier to mimick meta
classifier
Clustering or Unsupervised learning
Applications Customer segmentation e.g. for targeted
marketing Group/cluster existing customers based on time series of
payment history such that similar customers in same cluster.
Identify micro-markets and develop policies foreach
Collaborative filtering: group based on common items purchased
Image tiling Text clustering e.g. scatter/gather Compression
Distance functions
Numeric data: euclidean, manhattan distances Minkowski metric: [sum(xi-yi)^m]^(1/m) Larger m gives higher weight to larger distances
Categorical data: 0/1 to indicate presence/absence Euclidean distance: equal weightage to 1 and 0 match Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not
important data dependent measures: similarity of A and B depends on co-
occurance with C. Combined numeric and categorical data:weighted normalized
distance:
Distance functions on high dimensional data
Example: Time series, Text, Images Euclidian measures make all points equally far Reduce number of dimensions:
choose subset of original features using random projections, feature selection techniques
transform original features using statistical methods like Principal Component Analysis
Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.
Define non-distance based (model-based) clustering methods:
Clustering methods Hierarchical clustering
agglomerative Vs divisive single link Vs complete link
Partitional clustering distance-based: K-means model-based: EM density-based:
A Dendrogram Shows How the Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
agglomerative
divisive
a b c d e
a 0b 9 0c 3 7 0d 6 5 9 0e 11 10 8 2 0
Step
0St
ep 1
Step
2St
ep 3
Step
4
e ab cdde
ac
b d
e a b
c d
e
Step
4St
ep 3
Step
2St
ep 1
Step
0
Step
0St
ep 1
Step
2St
ep 3
Step
4
e ab cdde
ac
b d
e
a b
c d
e
Step
4St
ep 3
Step
2St
ep 1
Step
0
Single-link
Complete-link
Pros and Cons Single link:
confused by near overlap chaining effect
Complete link unnecessary splits of alongated point clouds sensitive to outliers
Several other hierarchical methods known:
Partitional methods: K-means
Criteria: minimize sum of square of distance Between each point and centroid of the cluster. Between each pair of points in the cluster
Algorithm: Select initial partition with K clusters: random, first K, K
separated points Repeat until stabilization:
Assign each point to closest cluster center Generate new cluster centers Adjust clusters by merging/splitting
Association rules Given set T of groups of items Example: set of baskets of items
purchased Goal: find all rules on itemsets of the
form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given
a > user threshold c
Example: Milk --> bread Lot of work done on scalable
algorithms
Milk, cerealTea, milk
Tea, rice, bread
cereal
T
Variants High confidence may not imply high
correlation Use correlations. Find expected support
and large departures from that interesting. Brin et al. Limited attempt. More complete work in statistical literature on
contingency tables.
Still too many rules, need to prune... Does not imply causality as in Bayesian
networks
Applications of fast itemset counting
Find correlated events: Applications in medicine: find redundant
tests Cross selling in retail, banking Improve predictive capability of classifiers
that assume attribute independence New similarity measures of categorical
attributes [Mannila et al, KDD 98]
Temporal mining Several large data domains inherently
temporal Stock prices Monitoring data:
patient monitors, manufacturing processes, performance logs
Transaction data
Lots of prior work from Signal processing Statistics Speech recognition
Temporal mining
Finding significant patterns along time Similarity matches and clustering Rules along time series:
Drop in kerosene prices --> increase in bronchitis cases
Classification on time series data: customers with high variance in balance likely to default. speed fluctuations with significant third order ARMA
coefficients are probably from drunk drivers.
Detecting drift in models along
Spatial scan statistics
(Paper in reading list)
Mining market Around 20 to 30 mining tool vendors Major tool players:
Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.
All pretty much the same set of tools Many embedded products:
fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany
Summary
What is data mining and an overview of the various operations:
Classification: regression, nearest neighbour, neural network, bayesian
Clustering: distance based (k-means), distribution based(EM)
Itemset counting
Several operations: challenge is choosing the right operation for the problem
Resources http://www.kdnuggets.com SIGKDD: http://www.acm.org/sigkdd