Introduction to Data Mining
-
Upload
avijit-karan -
Category
Documents
-
view
703 -
download
1
Transcript of Introduction to Data Mining
![Page 1: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/1.jpg)
1
An Introduction to Data Mining
1
![Page 2: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/2.jpg)
Understanding Data Mining Data growing at a phenomenal rate – almost double every
year Data kept in files but mostly in Relational Data bases Large operational databases usually for OLTP applications e.g.
Banking Large pool of past historical data Can we do something with the large amount of data? Find meaningful and interesting information from the data Can standard SQL do it? No, we need different approach and
algorithms Data Mining is the answer
22
![Page 3: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/3.jpg)
Why Now?
Data is being produced Data is being warehoused (Data Warehouse) The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available
33
![Page 4: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/4.jpg)
Data Mining Credit ratings/targeted marketing
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?
Customer relationship management Which customers are likely to be the most loyal, and which are most
likely to leave for a competitor?
Data Mining helps extract such information
44
![Page 5: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/5.jpg)
What is Data Mining? Extracting or Mining knowledge from large data set to find
patterns that are valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern
Also known as Knowledge Discovery in Databases Example
Which items are purchased together in a retail store? Fraudulent usage of credit cards – detect purchase of
extremely large amount compared to regular purchases
55
![Page 6: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/6.jpg)
Applications Banking: loan/credit card approval
predict good customers based on old customers Customer relationship management
identify those who are likely to leave for a competitor. Targeted marketing
identify likely responders to promotions Fraud detection - telecommunications, financial
transactions from an online stream of event identify fraudulent events
Manufacturing and production automatically adjust knobs when process parameter changes
66
![Page 7: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/7.jpg)
Applications (continued)
Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between
diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis:
identify new galaxies by searching for sub clusters Web site/store design and promotion:
find affinity of visitor to pages and modify layout
77
![Page 8: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/8.jpg)
Data Mining Versus KDD
Knowledge Discovery in databases (KDD) is the process of finding useful information and patterns in data
Data Mining is the use of algorithms to extract the information and pattern derived by the KDD process
Often these two terms are used interchangeably KDD is a process that involves different steps
8
![Page 9: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/9.jpg)
The KDD process Problem formulation Data collection Result evaluation and Visualization: Data cleaning : Remove noise and inconsistent data Data integration : Data from multiple sources combined Data selection: Select data relevant to the mining task Data transformation : Transform data (Summary, aggregation or
consolidate) appropriate for mining Data Mining : Find interesting pattern Pattern Evaluation : Identify truly interesting pattern Presentation : GUI
Knowledge discovery is an iterative process 99
![Page 10: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/10.jpg)
Mining on what kind of data? Large Data Volume
Relational Database Data Warehouse Flat files Web Transaction Database Object Oriented or Object Relational Database
1010
![Page 11: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/11.jpg)
11
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
11
Data Mining provides the Enterprise with intelligence
![Page 12: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/12.jpg)
Data Mining Algorithms
Data mining involves different algorithms to accomplish different tasks
All these algorithms attempt to fit a model to the data
The algorithms examine the data and determine a model that is a closest to the characteristics of the data being examined
12
![Page 13: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/13.jpg)
Some basic data mining tasks
Predictive: Predict values of data using known result found from different data (historical) Regression Classification Time series analysis
Descriptive: Identifies patterns or relationship in data Clustering / similarity matching Association rules and variants Summarization Sequence Discovery
1313
![Page 14: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/14.jpg)
Supervised Learning vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set Unsupervised learning (clustering)
The class labels of training data is unknown Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the data
14
![Page 15: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/15.jpg)
Classification Maps data into predefined classes or groups Given old data about customers and payments,
predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
1515
![Page 16: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/16.jpg)
Classification
Example I : general classification Credit Card companies must determine whether to
authorize credit card purchases. Each purchase is placed in one of the four categories: Authorize Ask for further identification before authorization Do not authorize Do not authorize but contact police
16
![Page 17: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/17.jpg)
Classification
Example II : Pattern recognition An airport security screening station is used to
determine if passengers are potential criminals or terrorists
To do this face of each passenger is scanned and its basic pattern (distance between eyes, size and shape of mouth, shape of head etc.) is identified
The pattern is compared to the entries in a database to see if it matches any patterns of known offenders
17
![Page 18: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/18.jpg)
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions, i.e., predicts unknown
or missing values Typical applications
Credit approval Target marketing Medical diagnosis Fraud detection
Classification vs. Prediction
18
![Page 19: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/19.jpg)
Model Construction (Process I)
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Classification 19
![Page 20: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/20.jpg)
Use the Model in Prediction (Process II)
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
20
![Page 21: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/21.jpg)
Another example
21
![Page 22: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/22.jpg)
Association Rules and Market Basket Analysis
22
![Page 23: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/23.jpg)
What is Market Basket Analysis?
Customer Analysis
Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases.
Product Analysis
Market basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase.
23
![Page 24: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/24.jpg)
Market Basket Example
Is soda typically purchased with bananas?Does the brand of soda make a difference?
Where should detergents be placed in theStore to maximize their sales?
Are window cleaning products purchased when detergents and orange juice are bought together?
How are the demographics of the neighborhood affecting what customers are buying?
?
?
?
?
24
![Page 25: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/25.jpg)
Association Rules
There has been a considerable amount of research in the area of Market Basket Analysis. Its appeal comes from the clarity and utility of its results, which are expressed in the form association rules.
Given A database of transactions Each transaction contains a set of items
Find all rules X->Y that correlate the presence of one set of items X with another set of items Y Example: When a customer buys bread and butter, they buy milk 85% of the time
+
25
![Page 26: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/26.jpg)
Results: Useful, Trivial, or Inexplicable? While association rules are easy to understand, they are not always useful.
Useful: On Fridays convenience store customers often purchase diapers and beer together.
Trivial: Customers who purchase maintenance agreements are very likely to purchase large appliances.
Inexplicable: When a new Super Store opens, one of the most commonly sold item is light bulbs.
26
![Page 27: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/27.jpg)
How Does It Work?
Orange juice, Soda
Milk, Orange Juice, Window Cleaner
Orange Juice, Detergent
Orange juice, detergent, soda
Window cleaner, soda
OJ
4
1
1
2
1
OJ
Window Cleaner
Milk
Soda
Detergent
1
2
1
1
0
1
1
1
0
0
2
1
0
3
1
1
0
0
1
2
WindowCleaner Milk Soda Detergent
Co-Occurrence of Products
Customer Items
1
2
3
4
5
Grocery Point-of-Sale Transactions
Orange Juice, Soda
Milk, Orange Juice, Window Cleaner
Orange Juice, Detergent
Orange Juice, Detergent, Soda
Window Cleaner, Soda
27
![Page 28: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/28.jpg)
OJ
Window Cleaner
Milk
Soda
Detergent
1
1
1
0
0
2
1
0
3
1
1
0
0
1
2
OJWindowCleaner Milk Soda Detergent
1
2
1
1
0
The co-occurrence table contains some simple patterns Orange juice and soda are more likely to be purchased together than any other two items Detergent is never purchased with window cleaner or milk Milk is never purchased with soda or detergent
These simple observations are examples of Associations and may suggest a formal rule like: If a customer purchases soda, THEN the customer also purchases orange juice
How Does It Work?
4
1
1
2
1
28
![Page 29: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/29.jpg)
How Good Are the Rules?
In the data, two of five transactions include both soda and orange juice, These two transactions support the rule. The support for the rule is two out of five or 40%
Since both transactions that contain soda also contain orange juice there is a high degree of confidence in the rule. In fact every transaction that contains soda contains orange juice. So the rule If soda, THEN orange juice has a confidence of 100%.
29
![Page 30: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/30.jpg)
Confidence and Support - How Good Are the Rules A rule must have some minimum user-specified confidence
1 & 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in 90% of the cases, the customer also bought 3.
A rule must have some minimum user-specified support
1 & 2 -> 3 should hold in some minimum percentage of transactions to have value.
30
![Page 31: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/31.jpg)
Confidence and Support
Transaction ID # Items
1
2
3
4
{ 1, 2, 3 }
{ 1,3 }
{ 1,4 }
{ 2, 5, 6 }
Frequent One Item Set Support
{ 1 }
{ 2 }
{ 3 }
{ 4 }
75 %
50 %
50 %
25 %
For minimum support = 50% = 2 transactions and minimum confidence = 50%
For the rule 1=> 3:Support = Support({1,3}) = 50%Confidence (1->3) = Support ({1,3})/Support({1}) = 66%Confidence (3->1)= Support ({1,3})/Support({3}) = 100%
Frequent Two Item Set Support
{ 1,2 }
{ 1,3 }
{ 1,4 }
{ 2,3 }
25 %
50 %
25 %
25 %
31
![Page 32: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/32.jpg)
Association Examples
Find all rules that have “Diet Coke” as a result. These rules may help plan what the store should do to boost the sales of Diet Coke.
Find all rules that have “Yogurt” in the condition. These rules may help determine what products may be impacted if the store discontinues selling “Yogurt”.
Find all rules that have “Brats” in the condition and “mustard” in the result. These rules may help in determining the additional items that have to be sold together to make it highly likely that mustard will also be sold.
Find the best k rules that have “Yogurt” in the result.
32
![Page 33: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/33.jpg)
Example - Minimum Support Pruning / Rule Generation
Transaction ID # Items
1
2
3
4
{ 1, 3, 4 }
{ 2, 3, 5 }
{ 1, 2, 3, 5 }
{ 2, 5 }
Itemset Support
{ 1 }
{ 2 }
{ 3 }
{ 4 }
{ 5 }
2
3
3
1
3
Itemset Support
{ 2 }
{ 3 }
{ 5 }
3
3
3
Itemset
{ 2 }
{ 3 }
{ 5 }
Itemset Support
{ 2, 3 }
{ 2, 5 }
{ 3, 5 }
2
3
2
Itemset Support
{ 2, 5 } 3
Scan Database Find Pairings Find Level of Support
Scan Database Find Pairings Find Level of Support
Two rules with the highest supportfor two item set: 2->5 and 5->2
33
![Page 34: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/34.jpg)
Other Association Rule Applications Quantitative Association Rules
Age[35..40] and Married[Yes] -> NumCars[2]
Association Rules with Constraints Find all association rules where the prices of items are > 100 dollars
Temporal Association Rules Diaper -> Beer (1% support, 80% confidence) Diaper -> Beer (20%support) 7:00-9:00 PM weekdays
Optimized Association Rules Given a rule (l < A < u) and X -> Y, Find values for l and u such that support
greater than certain threshold and maximizes a support and confidence. Check Balance [$ 30,000 .. $50,000] -> Certificate of Deposit (CD)= Yes
+34
![Page 35: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/35.jpg)
Classification by Decision Tree Learning A classic machine learning / data mining problem Develop rules for when a transaction belongs to a
class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm
35
![Page 36: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/36.jpg)
A Database… (Training Dataset)Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No<=30 High No Excellent No
31…40 High No Fair Yes>40 Medium No Fair Yes>40 Low Yes Fair Yes>40 Low Yes Excellent No
31…40 Low Yes Excellent Yes<=30 Medium No Fair No<=30 Low Yes Fair Yes>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes31…40 Medium No Excellent Yes31…40 High Yes Fair Yes
>40 Medium No Excellent No
36
![Page 37: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/37.jpg)
Output: A Decision TreeAge?
Student? Credit rating?
<=30 >40
No Yes Yes
Yes
31..40
No
fairexcellentyesno
37
![Page 38: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/38.jpg)
Algorithm: Decision Tree Induction Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in
advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain) Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf There are no samples left
38
![Page 39: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/39.jpg)
Different Possibilities for Partitioning Tuples Based on Splitting Criterion
39
![Page 40: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/40.jpg)
40
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DInfo
D
DDInfo
(D)InfoInfo(D)Gain(A) A
![Page 41: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/41.jpg)
41
Attribute Selection: Information Gain
g Class P: buys_computer = “Yes”g Class N: buys_computer = “No”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Hence
Similarly,
age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIDInfoage
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
246.0)()()( DInfoDInfoageGain ageage income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
)3,2(14
5I
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 IDInfo
![Page 42: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/42.jpg)
Attribute Age has the highest information gain
42
![Page 43: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/43.jpg)
Computing Information-Gain for Continuous-Value Attributes
Let attribute A be a continuous-valued attribute Must determine the best split point for A
Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered
as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is selected as the split-point for A
Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
43
![Page 44: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/44.jpg)
44
Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a
large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A) Ex.
gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the
splitting attribute
)||
||(log
||
||)( 2
1 D
D
D
DDSplitInfo j
v
j
jA
926.0)14
4(log
14
4)
14
6(log
14
6)
14
4(log
14
4)( 222 DSplitInfoA
![Page 45: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/45.jpg)
45
Gini index (CART, IBM IntelligentMiner) If a data set D contains examples from n classes, gini index, gini(D) is defined
as
where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
n
jp jDgini
1
21)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA
)()()( DginiDginiAginiA
![Page 46: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/46.jpg)
46
Gini index (CART, IBM IntelligentMiner) Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
but gini{medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes
459.014
5
14
91)(
22
Dgini
)(14
4)(
14
10)( 11},{ DGiniDGiniDgini mediumlowincome
![Page 47: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/47.jpg)
Comparing Attribute Selection Measures
The three measures, in general, return good results but Information gain:
biased towards multivalued attributes Gain ratio:
tends to prefer unbalanced splits in which one partition is much smaller than the others
Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and
purity in both partitions
47
![Page 48: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/48.jpg)
Regression
Used to map a data item to a real valued prediction variable
Regression involves learning of the function that does this mapping
Fits some known type of function – linear or any other polynomial
Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: a*x1 + b*x2 + c = Ci.
48
![Page 49: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/49.jpg)
Regression
Example A person wishes to reach a certain level of savings
before retirement He wants to predict his savings value based on its
current value and several past values He uses a simple linear regression formula to predict
this value by fitting past behavior to a linear function and then use this function to predict value at any point of time
49
![Page 50: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/50.jpg)
Time series analysis
Value of an attribute is examined as it varies over time Evenly spaced time points – daily, weekly, horly etc. Three basic functions in time series analysis
Distance measures are used to determine similarity between time series
Structure of the line is examined to determine its behavior Use historical time series plot to predict future values
Application Stock market analysis – whether to buy a stock or not
50
![Page 51: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/51.jpg)
Clustering : Unsupervised Learning
Similar to classification except groups are not predefined rather defined by data alone
Most similar data are grouped into same clusters Dissimilar data should be in different clusters
51
![Page 52: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/52.jpg)
Clustering Examples
Segment customer database based on similar buying patterns.
Group houses in a town into neighborhoods based on similar features.
Identify new plant species Identify similar Web usage patterns
52
![Page 53: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/53.jpg)
Clustering Unsupervised learning when old data with class labels
not available e.g. when introducing a new product. Group/cluster existing customers based on time
series of payment history such that similar customers in same cluster.
Key requirement: Need a good measure of similarity between instances.
Identify micro-markets and develop policies for each
5353
![Page 54: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/54.jpg)
54
Clustering Example
![Page 55: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/55.jpg)
55
Clustering Houses
Size BasedGeographic Distance Based
![Page 56: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/56.jpg)
Clustering Issues
Outlier handling Dynamic data Interpreting results Evaluating results Number of clusters Data to be used Scalability
56
![Page 57: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/57.jpg)
Impact of Outliers on Clustering
57
![Page 58: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/58.jpg)
Clustering Problem
Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k.
A Cluster, Kj, contains precisely those tuples mapped to it.
Unlike classification problem, clusters are not known a priori.
58
![Page 59: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/59.jpg)
Types of Clustering
Hierarchical – Nested set of clusters created. Partitional – One set of clusters created. Incremental – Each element handled one at a
time. Simultaneous – All elements handled together. Overlapping/Non-overlapping
59
![Page 60: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/60.jpg)
Agglomerative ExampleA B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
BA
E C
D
4
Threshold of
2 3 51
A B C D E61
![Page 61: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/61.jpg)
Partitional methods: K-means
Criteria: minimize sum of square of distance Between each point and centroid of the cluster. Between each pair of points in the cluster
Algorithm: Select initial partition with K clusters: random, first K,
K separated points Repeat until stabilization:
Assign each point to closest cluster center Generate new cluster centers Adjust clusters by merging/splitting
6262
![Page 62: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/62.jpg)
Collaborative Filtering
Given database of user preferences, predict preference of new user
Example: predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies
Example: predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer)
6363
![Page 63: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/63.jpg)
Mining market
Around 20 to 30 mining tool vendors Major tool players:
Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.
All pretty much the same set of tools Many embedded products:
fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany
6464
![Page 64: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/64.jpg)
65
Large-scale Endeavors
Clustering Classification Association Sequence DeviationSAS Decision
TreesSPSS
Oracle(Darwin)
ANN
IBM TimeSeries
DecisionTrees
DBMiner(Simon Fraser)
Products
![Page 65: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/65.jpg)
Vertical integration: Mining on the web
Web log analysis for site design: what are popular pages, what links are hard to find.
Electronic stores sales enhancements: recommendations, advertisement: Collaborative filtering: Net perception, Wisewire Inventory control: what was a shopper looking for and
could not find..
6666
![Page 66: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/66.jpg)
Some success stories
Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data Won over (manual) knowledge engineering approach http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed
description of the entire process Major US bank: customer attrition prediction
First segment customers based on financial behavior: found 3 segments Build attrition models for each of the 3 segments 40-50% of attritions were predicted == factor of 18 increase
Targeted credit marketing: major US banks find customer segments based on 13 months credit balances build another response model based on surveys increased response 4 times -- 2%
6767
![Page 67: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/67.jpg)
Relationship with other fields
Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures whereas
foundations of methods and formulations provided by statistics and machine learning.
automation for handling large, heterogeneous data
6868
![Page 68: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/68.jpg)
The Future
Database – RDBMS & SQL are the two milestones in the evolution of Database systems
Currently, data mining is little more than a set of tools that can be used to uncover previously hidden information in databases
Many tools are available, but no all-encompassing model or approach Future is to create an all-encompassing tools, better integrated, less
human interaction and human interpretation A major development could be creation of sophisticated query
language that includes normal SQL and complicated OLAP functions DMQL is a step towards that
69
![Page 69: Introduction to Data Mining](https://reader033.fdocuments.net/reader033/viewer/2022061123/54754765b4af9fc2078b4575/html5/thumbnails/69.jpg)
Thank You
70