recevait
Transcript of recevait
Data Mining inLarge Databases
(Contributing Slides by Gregory Piatetsky-Shapiro and
Rajeev Rastogi and Kyuseok ShimLucent Bell laboratories)
Overview
Introduction Association Rules Classification Clustering
Background
Corporations have huge databases containing a wealth of information
Business databases potentially constitute a goldmine of valuable business information
Very little functionality in database systems to support data mining applications
Data mining: The efficient discovery of previously unknown patterns in large databases
Applications
Fraud Detection Loan and Credit Approval Market Basket Analysis Customer Segmentation Financial Applications E-Commerce Decision Support Web Search
Data Mining Techniques
Association Rules Sequential Patterns Classification Clustering Similar Time Sequences Similar Images Outlier Discovery Text/Web Mining
Examples of Patterns Association rules
98% of people who purchase diapers buy beer Classification
People with age less than 25 and salary > 40k drive sports cars
Similar time sequences Stocks of companies A and B perform similarly
Outlier Detection Residential customers with businesses at home
Association Rules
Given: A database of customer transactions Each transaction is a set of items
Find all rules X => Y that correlate the presence of one set of items X with another set of items Y Any number of items in the consequent or
antecedent of a rule Possible to specify constraints on rules (e.g., find
only rules involving expensive imported products)
Association Rules
Sample Applications Market basket analysis Attached mailing in direct marketing Fraud detection for medical insurance Department store floor/shelf planning
Confidence and Support
A rule must have some minimum user-specified confidence1 & 2 => 3 has 90% confidence if when
a customer bought 1 and 2, in 90% of cases, the customer also bought 3.
A rule must have some minimum user-specified support1 & 2 => 3 should hold in some
minimum percentage of transactions to have business value
Example
Example:
For minimum support = 50%, minimum confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence
Transaction Id Purchased Items 1 {1, 2, 3}2 {1, 4}3 {1, 3}4 {2, 5, 6}
Problem Decomposition
1. Find all sets of items that have minimum support Use Apriori Algorithm
2. Use the frequent itemsets to generate the desired rules Generation is straight forward
Problem Decomposition - Example
TID Items1 {1, 2, 3}2 {1, 3}3 {1, 4}4 {2, 5, 6}Frequent Itemset Support
{1} 75%{2} 50%{3} 50%{1, 3} 50%
For minimum support = 50% and minimum confidence = 50%
For the rule 1 => 3:•Support = Support({1, 3}) = 50%•Confidence = Support({1,3})/Support({1}) = 66%
The Apriori Algorithm
Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 = {large items}for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk foreach transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support }Answer = Uk Fk
Key Observation
Every subset of a frequent itemset is also frequent
=> a candidate itemset in Ck+1 can be pruned if even one of its subsets is not contained in Fk
Apriori - Example
TID Items1 {1, 3, 4}2 {2, 3, 5}3 {1, 2, 3, 5}4 {2, 5}
Itemset Sup.{1} 2{2} 3{3} 3{4} 1{5} 3
Itemset Sup.{2} 3{3} 3{5} 3
Itemset{2, 3}{2, 5}{3, 5}
{2, 3} 2{2, 5} 3{3, 5} 2
Itemset Sup.{2, 5} 3
Database D C1 F1
C2 C2 F2
Scan D
Scan D
Sequential Patterns
Given: A sequence of customer transactions Each transaction is a set of items
Find all maximal sequential patterns supported by more than a user-specified percentage of customers
Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction
Classification Given:
Database of tuples, each assigned a class label
Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or
(married = YES)
Sample applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor)
Decision Tree
An internal node is a test on an attribute. A branch represents an outcome of the test, e.g.,
Color=red. A leaf node represents a class label or class label
distribution. At each node, one attribute is chosen to split
training examples into distinct classes as much as possible
A new case is classified by following a matching path to a leaf node.
Decision TreesOutlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Example Tree
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Outlook
HumidityWindy
Decision Tree Algorithms
Building phase Recursively split nodes using best
splitting attribute for node Pruning phase
Smaller imperfect decision tree generally achieves better accuracy
Prune leaf nodes recursively to prevent over-fitting
Attribute Selection
Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the
“purest” nodes Popular impurity criterion: information gain
Information gain increases with the average purity of the subsets that an attribute produces
Strategy: choose attribute that results in greatest information gain
Which attribute to select?
Computing information Information is measured in bits
Given a probability distribution, the info required to predict an event is the distribution’s entropy
Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the entropy:
nnn ppppppppp logloglog),,,entropy( 221121
Example: attribute “Outlook”
“ Outlook” = “Sunny”:
“Outlook” = “Overcast”:
“Outlook” = “Rainy”:
Expected information for attribute:
bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]
bits 0)0log(0)1log(10)entropy(1,)info([4,0]
bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]
971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0
Computing the information gain Information gain: (information before split) – (information after split)
Information gain for attributes from weather data: 0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0
bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("
bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("
Continuing to split
bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("
bits 020.0)Windy"gain("
The final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further
Decision Trees
Pros Fast execution time Generated rules are easy to interpret
by humans Scale well for large data sets Can handle high dimensional data
Cons Cannot capture correlations among
attributes Consider only axis-parallel cuts
Clustering
Given: Data points and number of desired clusters K
Group the data points into K clusters Data points within clusters are more similar
than across clusters Sample applications:
Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth
Traditional Algorithms
Partitional algorithms
Enumerate K partitions optimizing some criterion
Example: square-error criterion
mi is the mean of cluster Ci
k
i p Ci
i
mp1
2
K-means Algorithm
Assign initial means Assign each point to the cluster for the
closest mean Compute new mean for each cluster Iterate until criterion function converges
K-means example, step 1
k1
k2
k3
X
Y
Pick 3 initialclustercenters(randomly)
K-means example, step 2
k1
k2
k3
X
Y
Assigneach pointto the closestclustercenter
K-means example, step 3
X
Y
Moveeach cluster centerto the meanof each cluster
k1
k2
k2
k1
k3
k3
K-means example, step 4
X
Y
Reassignpoints closest to a different new cluster center
Q: Which points are reassigned?
k1
k2
k3
K-means example, step 4 …
X
Y
A: three points with animation
k1
k3k2
K-means example, step 4b
X
Y
re-compute cluster means
k1
k3k2
K-means example, step 5
X
Y
move cluster centers to cluster means
k2
k1
k3
Discussion Result can vary significantly depending
on initial choice of seeds Can get trapped in local minimum
Example:
To increase chance of finding global optimum: restart with different random seeds
instances
initial cluster centers
K-means clustering summary
Advantages Simple,
understandable items automatically
assigned to clusters
Disadvantages Must pick number
of clusters before hand
All items forced into a cluster
Too sensitive to outliers
Traditional Algorithms
Hierarchical clustering
Nested Partitions Tree structure
Agglomerative Hierarchcal Algorithms
Mostly used hierarchical clustering algorithm
Initially each point is a distinct cluster Repeatedly merge closest clusters until
the number of clusters becomes K Closest: dmean (Ci, Cj) = dmin (Ci, Cj) = Likewise dave (Ci, Cj) and dmax (Ci, Cj)
mm jiqp
CC jiqp
min,
Similar Time Sequences Given:
A set of time-series sequences Find
All sequences similar to the query sequence
All pairs of similar sequenceswhole matching vs. subsequence matching
Sample Applications Financial market Scientific databases Medical Diagnosis
Whole Sequence Matching
Basic Idea Extract k features from every sequence Every sequence is then represented as a
point in k-dimensional space Use a multi-dimensional index to store and
search these points Spatial indices do not work well for high
dimensional data
Similar Time Sequences
Take Euclidean distance as the similarity measure Obtain Discrete Fourier Transform (DFT) coefficients of
each sequence in the database Build a multi-dimensional index using first a few Fourier
coefficients Use the index to retrieve sequences that are at most
distance away from query sequence Post-processing:
compute the actual distance between sequences in the time domain
Outlier Discovery
Given: Data points and number of outliers (= n) to find
Find top n outlier points outliers are considerably dissimilar from the
remainder of the data Sample applications:
Credit card fraud detection Telecom fraud detection Medical analysis
Statistical Approaches
Model underlying distribution that generates dataset (e.g. normal distribution)
Use discordancy tests depending on data distribution distribution parameter (e.g. mean, variance) number of expected outliers
Drawbacks most tests are for single attribute In many cases, data distribution may not be
known
Distance-based Outliers
For a fraction p and a distance d, a point o is an outlier if p points lie at a
greater distance than d General enough to model statistical outlier
tests Develop nested-loop and cell-based
algorithms Scale okay for large datasets Cell-based algorithm does not scale well for
high dimensions