Sunita Sarawagi [email protected] Data mining and Machine Learning.

Sunita [email protected]

Data mining and Machine Learning

Data Mining Data mining is the process of semi-

automatically analyzing large databases to find useful patterns

Prediction based on past history Predict if a credit card applicant poses a good credit risk,

based on some attributes (income, job type, age, ..) and past history

Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms: Classification

Given a new item whose class is unknown, predict to which class it belongs

Regression formulae Given a set of mappings for an unknown function, predict the

function result for a new parameter value

Data Mining (Cont.) Descriptive Patterns

Associations Find books that are often bought by “similar” customers. If

a new such customer buys one such book, suggest the others too.

Associations may be used as a first step in detecting causation

E.g. association between exposure to chemical X and cancer,

Clusters E.g. typhoid cases were clustered in an area surrounding a

contaminated well Detection of clusters remains important in detecting

epidemics

Data mining Data: of various shapes and sizes Patterns/Model: of various shapes and sizes

Abstraction of data into some understandable and useful Basic structure of data

Set of instances/objects/cases/rows/points/examples Each instance: fixed set of attributes/dimensions/columns

Continuous Categorical

Patterns: Express one attribute as a function of another:

Classification, regression Group together related instances: clustering, projection,

factorization, itemset mining

Classification

Given old data about customers and payments, predict new applicant’s loan eligibility.

AgeSalaryProfessionLocationCustomer type

Previous customers Classifier Decision rules

Salary > 5 L

Prof. = Exec

New customer’s data

Good/bad

Labeled dataUnlabeled data

Training

Model

DeploymentClass label

Applications

Ad placement in search engines Book recommendation Citation databases: Google scholar, Citeseer Resume organization and job matching Retail data mining

Banking: loan/credit card approval predict good customers based on old customers

Customer relationship management: identify those who are likely to leave for a competitor.

Targeted marketing: identify likely responders to promotions

Machine translation Speech and handwriting recognition Fraud detection: telecommunications, financial transactions

from an online stream of event identify fraudulent events

Applications (continued) Medicine: disease outcome, effectiveness of

treatments analyze patient disease history: find relationship between

diseases

Molecular/Pharmaceutical: identify new drugs

Scientific data analysis: identify new galaxies by searching for sub clusters

Image and vision: Object recognition from images Remove noise from images Identifying scene breaks

The KDD process Problem fomulation Data collection

subset data: sampling might hurt if highly skewed data feature selection: principal component analysis, heuristic

search

Pre-processing: cleaning name/address cleaning, different meanings (annual, yearly),

duplicate removal, supplying missing values Transformation:

map complex objects e.g. time series data to features e.g. frequency

Choosing mining task and mining method: Result evaluation and Visualization:

Knowledge discovery is an iterative process

Mining products

Data warehouse

Preprocessing utilities

Mining operations

VisualizationTools

Commercial Tools– SAS Enterprise miner – SPSS– IBM Intelligent miner– Microsoft SQL Server Data

mining services– Oracle data mining (ODM)

Extract data via ODBC •Sampling

•Attribute transformation

Scalable algorithms• association• classification• clustering• sequence mining

Free

Weka

Individual algorithms

Mining operations

Classification Regression Classification trees Neural networks Bayesian learning Nearest neighbour Radial basis

functions Support vector

machines Meta learning

methods Bagging,boosting

Clustering hierarchical EM density

based

Sequence mining Time series

similarity Temporal

patterns

Itemset mining Association rules Causality

Sequential classification Graphical models

Hidden Markov Models

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: (linear or any other polynomial) Decision tree classifier: divide decision space into

piecewise constant regions. Neural networks: partition by non-linear

boundaries Probabilistic/generative models Lazy learning methods: nearest neighbor Support vector machines: boundary to maximally

separate classes

Decision tree learning

Decision tree classifiers Widely used learning method Easy to interpret: can be re-represented as if-

then-else rules Approximates function by piece wise constant

regions Does not require any prior knowledge of data

distribution, works well on noisy data. Has been applied to:

classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment. lots and lots of other applications..

Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.

Decision trees

Salary < 1 M

Prof = teaching

Good

Age < 30

BadBad Good

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Weather Data: Play or not Play?

Outlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Note:Outlook is theForecast,no relation to Microsoftemail program

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Example Tree for “Play?”

Outlook

HumidityWindy

Topics to be covered Tree construction:

Basic tree learning algorithm Measures of predictive ability High performance decision tree construction: Sprint

Tree pruning: Why prune Methods of pruning

Other issues: Handling missing data Continuous class labels Effect of training size

Tree learning algorithms ID3 (Quinlan 1986) Successor C4.5 (Quinlan 1993) CART SLIQ (Mehta et al) SPRINT (Shafer et al)

Basic algorithm for tree building Greedy top-down construction.

Gen_Tree (Node, data)

make node a leaf? Yes Stop

Find best attribute and best split on attribute

Partition data on split condition

For each child j of node Gen_Tree (node_j, data_j)

Selectioncriteria

Split criteria Select the attribute that is best for

classification. Intuitively pick one that best separates

instances of different classes. Quantifying the intuitive: measuring

separability: First define impurity of an arbitrary set S

consisting of K classes Smallest when consisting of only one class,

highest when all classes in equal number. Should allow computations in multiple stages.

1

Measures of impurity Entropy

Gini

k

iii ppSEntropy

1

log)(

k

iipSGini

1

21)(

Information gain

Information gain on partitioning S into r subsets Impurity (S) - sum of weighted impurity of each subset

0 p1

1E

ntro

py

r

jj

jr SEntropy

S

SSEntropySSSGain

11 )()()..,(

1 0

0.5

Gin

i

1

Information gain: example

S

K= 2, |S| = 100, p1= 0.6, p2= 0.4E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29

S1 S2

| S1 | = 70, p1= 0.8, p2= 0.2E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21

| S2| = 30, p1= 0.13, p2= 0.87

E(S2) = -0.13log0.13 - 0.87 log 0.87=.16

Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1

Weather Data: Play or not Play?

Outlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Which attribute to select?

witten&eibe

Example: attribute “Outlook”

“ Outlook” = “Sunny”:

“Outlook” = “Overcast”:

“Outlook” = “Rainy”:

Expected information for attribute:

bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]

bits 0)0log(0)1log(10)entropy(1,)info([4,0]

bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]

Note: log(0) is not defined, but we evaluate 0*log(0) as zero

971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0

witten&eibe

Computing the information gain

Information gain: (information before split) – (information

after split)

Information gain for attributes from weather data:

0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0

bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("

bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("

witten&eibe

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

witten&eibe

The final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further

witten&eibe

Preventing overfitting A tree T overfits if there is another tree T’

that gives higher error on the training data yet gives lower error on unseen data.

An overfitted tree does not generalize to unseen instances.

Happens when data contains noise or irrelevant attributes and training size is small.

Overfitting can reduce accuracy drastically: 10-25% as reported in Minger’s 1989 Machine learning

Example of over-fitting with binary data.

Compare error rates measured by

learn data large test set

Learn R(T) always decreases as tree grows (Q: Why?)

Test R(T) first declines then increases (Q: Why?)

Overfitting is the result tree of too much reliance on learn R(T)

Can lead to disasters when applied to new data

71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .20 .31

**10 .29 .30 9 .32 .34

7 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91

No.Terminal

NodesR(T) Rts(T)

Training Data Vs. Test Data Error Rates

Digit recognition dataset: CART book

Overfitting example Consider the case where a single attribute

xj is adequate for classification but with an error of 20%

Consider lots of other noise attributes that enable zero error during training

This detailed tree during testing will have an expected error of (0.8*0.2 + 0.2*0.8) = 32% whereas the pruned tree with only a single split on xj will have an error of only 20%.

Approaches to prevent overfitting

Two Approaches: Stop growing the tree beyond a certain point

Tricky, since even when information gain is zero an attribute might be useful (XOR example)

First over-fit, then post prune. (More widely used) Tree building divided into phases:

Growth phase Prune phase

Criteria for finding correct final tree size:

Three criteria: Cross validation with separate test data Statistical bounds: use all data for training but

apply statistical test to decide right size. (cross-validation dataset may be used to threshold)

Use some criteria function to choose best size Example: Minimum description length (MDL) criteria

Cross validation Partition the dataset into two disjoint parts:

1. Training set used for building the tree. 2. Validation set used for pruning the tree:

Rule of thumb: 2/3rds training, 1/3rd validation

Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data.

Starting bottom-up, prune nodes with error less than its children.

What if training data set size is limited? n-fold cross validation: partition training data into n parts D1,

D2…Dn. Train n classifiers with D-Di as training and Di as test instance. Pick average. (how?)

Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a

conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Rule-based pruning Tree-based pruning limits the kind of

pruning. If a node is pruned all subtrees under it has to be pruned.

Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root.

On the validation set, independently prune tests from each rule to get the highest accuracy for that rule.

Sort rule by decreasing accuracy..

Regression trees Decision tree with continuous class labels: Regression trees approximates the

function with piece-wise constant regions. Split criteria for regression trees:

Predicted value for a set S = average of all values in S Error: sum of the square of error of each member of S

from the predicted average. Pick smallest average error.

Splits on categorical attributes: Can it be better than for discrete class labels? Homework.

Other types of trees Multi-way trees on low-cardinality

categorical data Multiple splits on continuous attributes

[Fayyad 93, Multi-interval discretization of continuous attributes]

Multi attribute tests on nodes to handle correlated attributes

multivariate linear splits [Oblique trees, Murthy 94]

Issues

Methods of handling missing values assume majority value take most probable path

Allowing varying costs for different attributes

Pros and Cons of decision trees

• Cons– Not effective for very high dimensional data where information about the class is spread in small ways over many correlated features

–Example: words in text classification

–Not robust to dropping of important features even when correlated substitutes exist in data

• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Intuitive

From Jiawei Han's slides

The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D

space. The nearest neighbor are defined in terms of

Euclidean distance. The target function could be discrete- or real-

valued. For discrete-valued, the k-NN returns the most

common value among the k training examples nearest to xq.

Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. .

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Locally weighted regression: learn a new regression equation by weighting each

training instance based on distance from new instance

Radial Basis Functions

Other lazy learning methods

• Cons– Slow during application.– No feature selection.– Notion of proximity vague

• Pros+ Fast training

Bayesian learning Assume a probability model on generation of

data. Apply bayes theorem to find most likely class as:

Naïve bayes: Assume attributes conditionally independent given class value

Easy to learn probabilities by counting, one pass counting Useful in some domains e.g. text. Numeric attributes must be discretized

)(

)()|(max)|(max

dp

cpcdpdcpc jj

cj

c jj

n

iji

j

ccap

dp

cpc

j 1

)|()(

)(max

Bayesian belief network Find joint probability over set of variables

making use of conditional independence whenever known

Learning parameters hard when hidden units: use gradient descent / EM algorithms

Learning structure of network harder

b bb

a d

eC

ad ad ad ad

0.1 0.2 0.3 0.4

0.3 0.2 0.1 0.5Variable e independent

of d given b

Neural networks Useful for learning complex data like

handwriting, speech and image recognition

Neural networkClassification tree

Decision boundaries:

Linear regression

Pros and Cons of Neural Network

• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes

• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features

Conclusion: Use neural nets only if decision trees/NN fail.

Linear discriminants

Problem setting Given a binary classification problem with

points in d dimensions Training data n vectors with predictions of

the form: (x1,y1),…(xn,yn) Each y will take value 1 or -1 Goal is to learn a function of the form:

F(x) = w.x + b= w1x1+w2x2+…+wdxd + b

Linear regression

Developed for the case of real-valued y y = f(x) = w.x+w0

Rewrite as y = w.x with x-s padded with a 1.

Error:

Minimize error by differentiating wrt w Minimum reached at w=(X’X)-1X’Y

Fisher’s linear discriminant Find hyperplane (w,b) on which the

projection of the data is maximally separated.

Cost function:

Where i and i are the mean and standard deviation of the projected points w.x+b for all point x in class i , pi is fraction in class i.

The linear discriminate maximizes above separation when

w = (m1-m2)’S-1x where m1 is the mean of the x values along each class and S-1 is the covariance matrix of the data.

b = (m1-m2)’S-1(m1+m2) (mid-point of the two means on the linear discriminate)

Fisher’s discriminant

fi

fj

This maximizes separation between projected red and

black points average

Shortcomings Perceptron: ill-posed. several values of w

might yield the same zero error on training data

Support vector machines

Binary classifier: find hyper-plane providing maximum margin between vectors of the two classes

fi

fj

Support vector machines

Separators with larger margin will have smaller generalization error

Geometry of SVMs

fi

fj w

-b||w||

1/||w||

wx+b||w||

Linear separators Most complex real-world applications

require more than linear separators One way to get around the problem is to

represent the data in a transformed coordinate space on which linear separators can be learnt.

Example f(m1,m2,r)=Cm1m2/r2 is not linear but f(ln m1, ln m2, ln r) is.

Support Vector Machines

Extendable to: Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992)

Good generalization performance OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims)

Requires tuning: which kernel, what parameters?

Several freely available packages: SVMTorch

Locally weighted regression

Learn a new regression equation by weighting each training instance based on distance from new instance

Locally Weighted Regression

Regression equation

Find wi-s so as to minimize error:

Construct an explicit approximation to f over a local region surrounding query instance xq.

The target function f is approximated near xq using the linear function:

minimize the squared error: distance-decreasing weight K

( ) ( ) ( )f x w w a x wnan x 0 1 1

Dx

xfxfDE 2))(ˆ)((21)(

E xq f x f xx k nearest neighbors of xq

K d xq x( ) ( ( ) ( ))_ _ _ _

( ( , ))

12

2

Feature subset selection Embedded: in the mining algorithm Filter: features in advance Wrapper: generate candidate features and

test on black box mining algorithm high cost but provide better algorithm dependent

features

Meta learning methods No single classifier good under all cases Difficult to evaluate in advance the

conditions Meta learning: combine the effects of the

classifiers Voting: sum up votes of component classifiers Combiners: learn a new classifier on the outcomes of

previous ones: Boosting: staged classifiers

Disadvantage: interpretation hard Knowledge probing: learn single classifier to mimick meta

classifier

Clustering or Unsupervised learning

Applications Customer segmentation e.g. for targeted

marketing Group/cluster existing customers based on time series of

payment history such that similar customers in same cluster.

Identify micro-markets and develop policies foreach

Collaborative filtering: group based on common items purchased

Image tiling Text clustering e.g. scatter/gather Compression

Distance functions

Numeric data: euclidean, manhattan distances Minkowski metric: [sum(xi-yi)^m]^(1/m) Larger m gives higher weight to larger distances

Categorical data: 0/1 to indicate presence/absence Euclidean distance: equal weightage to 1 and 0 match Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not

important data dependent measures: similarity of A and B depends on co-

occurance with C. Combined numeric and categorical data:weighted normalized

distance:

Distance functions on high dimensional data

Example: Time series, Text, Images Euclidian measures make all points equally far Reduce number of dimensions:

choose subset of original features using random projections, feature selection techniques

transform original features using statistical methods like Principal Component Analysis

Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.

Define non-distance based (model-based) clustering methods:

Clustering methods Hierarchical clustering

agglomerative Vs divisive single link Vs complete link

Partitional clustering distance-based: K-means model-based: EM density-based:

A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

agglomerative

divisive

a b c d e

a 0b 9 0c 3 7 0d 6 5 9 0e 11 10 8 2 0

Step

0St

ep 1

Step

2St

ep 3

Step

4

e ab cdde

ac

b d

e a b

c d

e

Step

4St

ep 3

Step

2St

ep 1

Step

0

Step

0St

ep 1

Step

2St

ep 3

Step

4

e ab cdde

ac

b d

e

a b

c d

e

Step

4St

ep 3

Step

2St

ep 1

Step

0

Single-link

Complete-link

Pros and Cons Single link:

confused by near overlap chaining effect

Complete link unnecessary splits of alongated point clouds sensitive to outliers

Several other hierarchical methods known:

Partitional methods: K-means

Criteria: minimize sum of square of distance Between each point and centroid of the cluster. Between each pair of points in the cluster

Algorithm: Select initial partition with K clusters: random, first K, K

separated points Repeat until stabilization:

Assign each point to closest cluster center Generate new cluster centers Adjust clusters by merging/splitting

Association rules Given set T of groups of items Example: set of baskets of items

purchased Goal: find all rules on itemsets of the

form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given

a > user threshold c

Example: Milk --> bread Lot of work done on scalable

algorithms

Milk, cerealTea, milk

Tea, rice, bread

cereal

T

Variants High confidence may not imply high

correlation Use correlations. Find expected support

and large departures from that interesting. Brin et al. Limited attempt. More complete work in statistical literature on

contingency tables.

Still too many rules, need to prune... Does not imply causality as in Bayesian

networks

Applications of fast itemset counting

Find correlated events: Applications in medicine: find redundant

tests Cross selling in retail, banking Improve predictive capability of classifiers

that assume attribute independence New similarity measures of categorical

attributes [Mannila et al, KDD 98]

Temporal mining Several large data domains inherently

temporal Stock prices Monitoring data:

patient monitors, manufacturing processes, performance logs

Transaction data

Lots of prior work from Signal processing Statistics Speech recognition

Temporal mining

Finding significant patterns along time Similarity matches and clustering Rules along time series:

Drop in kerosene prices --> increase in bronchitis cases

Classification on time series data: customers with high variance in balance likely to default. speed fluctuations with significant third order ARMA

coefficients are probably from drunk drivers.

Detecting drift in models along

Spatial scan statistics

(Paper in reading list)

Mining market Around 20 to 30 mining tool vendors Major tool players:

Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.

All pretty much the same set of tools Many embedded products:

fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany

Summary

What is data mining and an overview of the various operations:

Classification: regression, nearest neighbour, neural network, bayesian

Clustering: distance based (k-means), distribution based(EM)

Itemset counting

Several operations: challenge is choosing the right operation for the problem

Resources http://www.kdnuggets.com SIGKDD: http://www.acm.org/sigkdd

Sunita Sarawagi [email protected] Data mining and Machine Learning.

Documents

Transcript of Sunita Sarawagi [email protected] Data mining and Machine Learning.