Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita...
-
Upload
camron-barker -
Category
Documents
-
view
221 -
download
1
Transcript of Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita...
![Page 1: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/1.jpg)
Data Mining Algorithms
Prof. S. Sudarshan CSE Dept, IIT Bombay
Most Slides Courtesy Prof. Sunita Sarawagi
School of IT, IIT Bombay
![Page 2: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/2.jpg)
Overview
Decision Tree classification algorithms
Clustering algorithmsChallengesResources
![Page 3: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/3.jpg)
Decision Tree Classifiers
![Page 4: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/4.jpg)
Decision tree classifiers
Widely used learning method Easy to interpret: can be re-represented as if-then-else
rules Approximates function by piece wise constant regions Does not require any prior knowledge of data
distribution, works well on noisy data. Has been applied to:
classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment.
![Page 5: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/5.jpg)
Setting
Given old data about customers and payments, predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
![Page 6: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/6.jpg)
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teaching
Good
Age < 30
BadBad Good
![Page 7: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/7.jpg)
Topics to be coveredTree construction:
Basic tree learning algorithm Measures of predictive ability High performance decision tree construction: Sprint
Tree pruning: Why prune Methods of pruning
Other issues: Handling missing data Continuous class labels Effect of training size
![Page 8: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/8.jpg)
Tree learning algorithms
ID3 (Quinlan 1986)Successor C4.5 (Quinlan 1993)SLIQ (Mehta et al)SPRINT (Shafer et al)
![Page 9: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/9.jpg)
Basic algorithm for tree building
Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf?Yes Stop
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Selectioncriteria
![Page 10: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/10.jpg)
Split criteria
Select the attribute that is best for classification.
Intuitively pick one that best separates instances of different classes.
Quantifying the intuitive: measuring separability:
First define impurity of an arbitrary set S consisting of K classes
1
![Page 11: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/11.jpg)
Impurity Measures
Information entropy:
Zero when consisting of only one class, one when all classes in equal number
Other measures of impurity: Gini:
k
iii ppSEntropy
1
log)(
k
iipSGini
1
21)(
![Page 12: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/12.jpg)
Split criteria
K classes, set of S instances partitioned into r subsets. Instance Sj has fraction pij instances of class j.
Information entropy:
Gini index:
r
j
k
iijij
j ppS
S
1 1
log
)(1 1
21
r
j
k
iij
j pS
S
0 1Impurity
1/4
Gini
r =1, k=2
![Page 13: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/13.jpg)
Information gain
Information gain on partitioning S into r subsets
Impurity (S) - sum of weighted impurity of each subset
r
jj
jr SEntropy
S
SSEntropySSSGain
11 )()()..,(
![Page 14: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/14.jpg)
Information gain: example
S
K= 2, |S| = 100, p1= 0.6, p2= 0.4E(S) = -0.6 log(0.6) - 0.4 log
(0.4)=0.29
S1 S2
| S1 | = 70, p1= 0.8, p2= 0.2E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21
| S2| = 30, p1= 0.13, p2= 0.87
E(S2) = -0.13log0.13 - 0.87 log 0.87=.16
Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
![Page 15: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/15.jpg)
Meta learning methods
No single classifier good under all casesDifficult to evaluate in advance the conditionsMeta learning: combine the effects of the
classifiers Voting: sum up votes of component classifiers Combiners: learn a new classifier on the outcomes of
previous ones: Boosting: staged classifiers
Disadvantage: interpretation hard Knowledge probing: learn single classifier to mimic
meta classifier
![Page 16: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/16.jpg)
SPRINT (Serial PaRallelizable INduction of decision Trees)
Decision-tree classifier for data mining
Design goals: Able to handle large disk-resident
training sets No restrictions on training-set size Easily parallelizable
![Page 17: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/17.jpg)
Example
Example DataAge Car Type42 family
18 truck
57 sports
21 sports
28 family
72 truck
Age < 25
CarType in {sports}
High
High Low
RiskLow
High
High
High
Low
Low
![Page 18: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/18.jpg)
Building tree
GrowTree(TrainingData D) Partition(D);
Partition(Data D) if (all points in D belong to the same class) then return; for each attribute A do evaluate splits on attribute A; use best split found to partition D into D1 and D2; Partition(D1); Partition(D2);
![Page 19: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/19.jpg)
Data Setup: Attribute Lists One list for each attribute Entries in an Attribute List consist of:
attribute value class value record id
Lists for continuous attributes are in sorted order Lists may be disk-resident Each leaf-node has its own set of attribute lists
representing the training examples belonging to that leaf
Age Risk RID17 High 120 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
Example list:
![Page 20: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/20.jpg)
Attribute Lists: ExampleAge Car Type Risk23 family High17 sports High43 sports High
68 family Low
32 truck Low
20 family High
Car Type Risk RIDfamily High 0sports High 1sports High 2
family Low 3
truck Low 4
family High 5
Age Risk RID23 High 017 High 143 High 2
68 Low 3
32 Low 4
20 High 5
Age Risk RID17 High 120 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
Car Type Risk RIDfamily High 0sports High 1sports High 2
family Low 3
truck Low 4
family High 5
Initial Attribute Lists for the root node:
![Page 21: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/21.jpg)
Evaluating Split Points
Gini Index if data D contains examples from c
classesGini(D) = 1 - pj2
where pj is the relative frequency of class j in D If D split into D1 & D2 with n1 & n2 tuples each
Ginisplit(D) = n1* gini(D1) + n2* gini(D2) n n
Note: Only class frequencies are needed to compute index
![Page 22: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/22.jpg)
Finding Split Points
For each attribute A do evaluate splits on attribute A using
attribute list
Keep split with lowest GINI index
![Page 23: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/23.jpg)
Finding Split Points: Continuous Attrib.
Consider splits of form: value(A) < x Example: Age < 17
Evaluate this split-form for every value in an attribute list
To evaluate splits on attribute A for a given tree-node:
Initialize class-histogram of left child to zeroes;Initialize class-histogram of right child to same as its parent;
for each record in the attribute list doevaluate splitting index for value(A) <
record.value;using class label of the record, update class
histograms;
![Page 24: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/24.jpg)
Finding Split Points: Continuous Attrib.
Age Risk RID23 High 017 High 1
43 High 2
68 Low 3
32 Low 4
20 High 5
Attribute List
High Low4 2
High Low0 0
High Low4 2
High Low4 2
High Low4 2
High Low0 0
High Low4 2
Position of cursor in scan
0: Age < 17
3: Age < 32
6
State of Class Histograms:
Left Child Right Child
1: Age < 20
High Low0 0
High Low0 0
GINI Index:
GINI = undef
GINI = 0.4
GINI = 0.222
GINI = undef
![Page 25: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/25.jpg)
Finding Split Points: Categorical Attrib.
Consider splits of the form: value(A) {x1, x2, ..., xn} Example: CarType {family, sports}
Evaluate this split-form for subsets of domain(A) To evaluate splits on attribute A for a given tree
node:initialize class/value matrix of node to zeroes;for each record in the attribute list do
increment appropriate count in matrix;evaluate splitting index for various subsets using the constructed matrix;
![Page 26: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/26.jpg)
Finding Split Points: Categorical Attrib.
Attribute List
High Low
family 2 1
sports 2 0
truck 0 1
class/value matrix
Car Type Risk RIDfamily High 0
sports High 1
sports High 2
family Low 3
truck Low 4
family High 5
CarType in {family}High Low2 1
High Low2 1
Left Child Right Child GINI Index:
High Low2 1
High Low2 1
CarType in {truck}
GINI = 0.444
GINI = 0.267
High Low2 0
High Low2 0
CarType in {sports} GINI = 0.333
![Page 27: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/27.jpg)
Performing the Splits
The attribute lists of every node must be divided among the two children
To split the attribute lists of a give node:for the list of the attribute used to split this node do
use the split test to divide the records;collect the record ids;
build a hashtable from the collected ids;
for the remaining attribute lists douse the hashtable to divide each list;
build class-histograms for each new leaf;
![Page 28: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/28.jpg)
Performing the Splits: Example
Age < 32
Age Risk RID17 High 120 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
Car Type Risk RIDfamily High 0sports High 1sports High 2
family Low 3
truck Low 4
family High 5
Age Risk RID17 High 120 High 5
23 High 0
Age Risk RID32 Low 443 High 2
68 Low 3
Car Type Risk RIDfamily High 0sports High 1
family High 5
Car Type Risk RIDsports High 2
family Low 3
truck Low 4
Hash Table0 Left1 Left2 Right3 Right4 Right5 Left
![Page 29: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/29.jpg)
Sprint: summary
Each node of the decision tree classifier, requires examining possible splits on each value of each attribute.
After choosing a split attribute, need to partition all data into its subset.
Need to make this search efficient. Evaluating splits on numeric attributes:
Sort on attribute value, incrementally evaluate gini Splits on categorical attributes
For each subset, find gini and choose the best For large sets, use greedy method
![Page 30: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/30.jpg)
Approaches to prevent overfitting
Stop growing the tree beyond a certain point
First over-fit, then post prune. (More widely used) Tree building divided into phases:
Growth phasePrune phase
Hard to decide when to stop growing the tree, so second appraoch more widely used.
![Page 31: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/31.jpg)
Criteria for finding correct final tree size:
Cross validation with separate test data Use all data for training but apply statistical
test to decide right size. Use some criteria function to choose best size
Example: Minimum description length (MDL) criteria
Cross validation approach: Partition the dataset into two disjoint parts:
1. Training set used for building the tree.2. Validation set used for pruning the tree
Build the tree using the training-set. Evaluate the tree on the validation set and at each
leaf and internal node keep count of correctly labeled data.
Starting bottom-up, prune nodes with error less than its children.
![Page 32: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/32.jpg)
Cross validation..
Need large validation set to smooth out over-fittings of training data. Rule of thumb: one-third.
What if training data set size is limited? Generate many different parititions of data. n-fold cross validation: partition training data into
n parts D1, D2…Dn. Train n classifiers with D-Di as training and Di as
test instance. Pick average.
![Page 33: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/33.jpg)
Rule-based pruning
Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned.
Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root.
On the validation set, independently prune tests from each rule to get the highest accuracy for that rule.
Sort rule by decreasing accuracy..
![Page 34: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/34.jpg)
MDL-based pruning
Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree
Prune branch if over-fitted philosophy: use tree that minimizes
description length of training data
![Page 35: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/35.jpg)
Regression trees
Decision tree with continuous class labels:
Regression trees approximates the function with piece-wise constant regions.
Split criteria for regression trees: Predicted value for a set S = average of all
values in S Error: sum of the square of error of each
member of S from the predicted average. Pick smallest average error.
![Page 36: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/36.jpg)
Issues
Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes]
Multi attribute tests on nodes to handle correlated attributes multivariate linear splits [Oblique trees, Murthy 94]
Methods of handling missing values assume majority value take most probable path
Allowing varying costs for different attributes
![Page 37: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/37.jpg)
Pros and Cons of decision trees
� Cons Cannot handle complicated relationship between features simple decision boundaries problems with lots of missing data
� Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features
More information: http://www.recursive-partitioning.com/
![Page 38: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/38.jpg)
Clustering or Unsupervised learning
![Page 39: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/39.jpg)
Distance functions
Numeric data: euclidean, manhattan distances Minkowski metric: [sum(xi-yi)^m]^(1/m) Larger m gives higher weight to larger distances
Categorical data: 0/1 to indicate presence/absence Euclidean distance: equal weightage to 1 and 0 match Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0
matches not important Combined numeric and categorical data:weighted normalized
distance:
![Page 40: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/40.jpg)
Distance functions on high dimensional data
Example: Time series, Text, Images Euclidian measures make all points equally far Reduce number of dimensions:
choose subset of original features using random projections, feature selection techniques
transform original features using statistical methods like Principal Component Analysis
Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.
Define non-distance based (model-based) clustering methods:
![Page 41: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/41.jpg)
Clustering methods
Hierarchical clustering agglomerative Vs divisive single link Vs complete link
Partitional clustering distance-based: K-means model-based: EM density-based:
![Page 42: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/42.jpg)
Partitional methods: K-meansCriteria: minimize sum of square of distance
Between each point and centroid of the cluster.Between each pair of points in the cluster
Algorithm: Select initial partition with K clusters: random,
first K, K separated points
Repeat until stabilization:Assign each point to closest cluster centerGenerate new cluster centersAdjust clusters by merging/splitting
![Page 43: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/43.jpg)
Properties
May not reach global optimaConverges fast in practice: guaranteed for
certain forms of optimization function Complexity: O(KndI):
I number of iterations, n number of points, d number of dimensions, K number of clusters.
Database research on scalable algorithms: Birch: one/two pass of data by keeping R-tree
like index in memory [Sigmod 96]
![Page 44: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/44.jpg)
Model based clustering
Assume data generated from K probability distributions
Typically Gaussian distribution Soft or probabilistic version of K-means clustering
Need to find distribution parameters.EM Algorithm
![Page 45: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/45.jpg)
EM Algorithm
Initialize K cluster centersIterate between two steps
Expectation step: assign points to clusters
Maximation step: estimate model parameters
)),,() |Pr(
)Pr(/) |Pr()Pr() |Pr() (
ikkki
ikikikki
dNcd
dcdcdccdP
m
ik
ji
kiik cdP
cdPd
m 1 ) (
) (1
![Page 46: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/46.jpg)
Properties
May not reach global optimaConverges fast in practice:
guaranteed for certain forms of optimization function
Complexity: O(KndI): I number of iterations, n number of
points, d number of dimensions, K number of clusters.
![Page 47: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/47.jpg)
Scalable clustering algorithms
Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96]
Fayyad and Bradley: Sample repetitively and update summary of clusters stored in memory (K-mean and EM) [KDD 98]
Dasgupta 99: Recent theoretical breakthrough, find Gaussian clusters with guaranteed performance Random projections
![Page 48: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/48.jpg)
To Learn More
![Page 49: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/49.jpg)
Books
Ian H. Witten and Frank Eibe,Data mining : practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, 1999
Usama Fayyad et al. (eds), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
Tom Mitchell, Machine Learning, McGraw-Hill
![Page 50: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/50.jpg)
SoftwarePublic domain
Weka 3: data mining algos in Java (http://www.cs.waikato.ac.nz/~ml/weka)classification, regression
MLC++: data mining tools in C++mainly classification
Free for universities try convincing IBM to give it free!
Datasets: follow links from www.kdnuggets.com to UC Irvine site
![Page 51: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/51.jpg)
Resources
http://www.kdnuggets.com Great site with links to software, datasets etc. Be
sure to visit it.
http://www.cs.bham.ac.uk/~anp/TheDataMine.html OLAP: http://altaplana.com/olap/ SIGKDD: http://www.acm.org/sigkdd Data mining and knowledge discovery journal:
http://www.research.microsoft.com/research/datamine/
Communications of ACM Special Issue on Data Mining, Nov 1996
![Page 52: Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay.](https://reader034.fdocuments.net/reader034/viewer/2022051401/56649edc5503460f94bed53a/html5/thumbnails/52.jpg)
Resources at IITB
http://www.cse.iitb.ernet.in/~dbms IITB DB group home page
http://www.it.iitb.ernet.in/~sunita/it642 Data Warehousing and Data Mining
course offered by Prof. Sunita Sarawagi at IITB