Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.
-
Upload
derrick-andrews -
Category
Documents
-
view
227 -
download
0
Transcript of Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.
![Page 1: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/1.jpg)
Data Mining: Introduction
Based on slides from:Tan, Steinbach, Kumar
![Page 2: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/2.jpg)
Lots of data is being collected and warehoused Web data, e-commerce purchases at department/
grocery stores Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
![Page 3: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/3.jpg)
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
![Page 4: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/4.jpg)
Mining Large Data Sets - Motivation
There is often information “hidden” in the data that is not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of analysts
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
![Page 5: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/5.jpg)
What is Data Mining?Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
![Page 6: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/6.jpg)
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Preprocessing and Mining
Original Data
TargetData
PreprocessedData
PatternsKnowledge
DataIntegrationand Selection
Preprocessing
ModelConstruction
Interpretation
![Page 7: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/7.jpg)
What is (not) Data Mining?
What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
![Page 8: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/8.jpg)
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality
of data Heterogeneous,
distributed nature of data
Origins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
![Page 9: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/9.jpg)
Data Mining Tasks
Prediction Methods Use some variables to predict unknown
or future values of other variables.
Description Methods Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
![Page 10: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/10.jpg)
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
![Page 11: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/11.jpg)
Classification: Definition Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
![Page 12: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/12.jpg)
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
cate
gorica
l
cate
gorica
l
contin
uous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
![Page 13: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/13.jpg)
Classification: Application 1
Direct Marketing Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product. Approach:
• Use the data for a similar product introduced before. • We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
![Page 14: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/14.jpg)
Classification: Application 2
Fraud Detection Goal: Predict fraudulent cases in credit card
transactions. Approach:
• Use credit card transactions and the information on its account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.• Use this model to detect fraud by observing credit
card transactions on an account.
![Page 15: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/15.jpg)
Classification: Application 3
Customer Attrition/Churn: Goal: To predict whether a customer is
likely to be lost to a competitor. Approach:
• Use detailed record of transactions with each of the past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.• Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
![Page 16: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/16.jpg)
Classification: Application 4
Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky
objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
Approach:• Segment the image. • Measure image attributes (features) - 40 of them per
object.• Model the class based on these features.• Success Story: Could find 16 new high red-shift
quasars, some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
![Page 17: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/17.jpg)
Classifying GalaxiesEarly
Intermediate
Late
Data Size: • 72 million stars, 20 million galaxies• Object Catalog: 9 GB• Image Database: 150 GB
Class: • Stages of Formation
Attributes:• Image features, • Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
![Page 18: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/18.jpg)
Clustering Definition Given a set of data points, each
having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more
similar to one another. Data points in separate clusters are less
similar to one another. Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
![Page 19: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/19.jpg)
Illustrating ClusteringEuclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximized
![Page 20: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/20.jpg)
Clustering: Application 1 Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach: • Collect different attributes of customers based on
their geographical and lifestyle related information.• Find clusters of similar customers.• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from different clusters.
![Page 21: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/21.jpg)
Clustering: Application 2
Document Clustering: Goal: To find groups of documents that
are similar to each other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
![Page 22: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/22.jpg)
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in
these documents (after some word filtering).
Category TotalArticles
CorrectlyPlaced
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
![Page 23: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/23.jpg)
Clustering: S&P 500 Stock Data
Discovered Clusters Industry Group
1Applied-Matl-DOW N,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Technology1-DOWN
2Apple-Comp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3Fannie-Mae-DOWN,Fed-Home-Loan-DOW N,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP
Oil-UP
Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the
events described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
![Page 24: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/24.jpg)
Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
![Page 25: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/25.jpg)
Association Rule Discovery: Application 1
Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to
determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see
which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
![Page 26: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/26.jpg)
Association Rule Discovery: Application 2
Supermarket shelf management. Goal: To identify items that are bought
together by sufficiently many customers. Approach: Process the point-of-sale data
collected with barcode scanners to find dependencies among items.
A classic rule --• If a customer buys diaper and milk, then he is
very likely to buy beer.• So, don’t be surprised if you find six-packs
stacked next to diapers!
![Page 27: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/27.jpg)
Association Rule Discovery: Application 3
Inventory Management: Goal: A consumer appliance repair company
wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
![Page 28: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/28.jpg)
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.
Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B) (C) (D E)
<= ms
<= xg >ng <= ws
(A B) (C) (D E)
![Page 29: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/29.jpg)
Sequential Pattern Discovery: Examples
In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)
In point-of-sale transaction sequences, Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)
![Page 30: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/30.jpg)
Regression Predict a value of a given continuous valued
variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields. Examples:
Predicting sales amounts of new product based on advetising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
![Page 31: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/31.jpg)
Deviation/Anomaly Detection Detect significant deviations from
normal behavior Applications:
Credit Card Fraud Detection
Network Intrusion Detection
Typical network traffic at University level may reach over 100 million connections per day
![Page 32: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/32.jpg)
Challenges of Data Mining
Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data
![Page 33: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/33.jpg)
Data Mining Classification: Basic Concepts,
Decision Trees, and Model Evaluation
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 33
![Page 34: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/34.jpg)
Classification: Definition Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
![Page 35: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/35.jpg)
Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
![Page 36: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/36.jpg)
Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance, weather, entertainment, sports, etc
![Page 37: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/37.jpg)
Classification Techniques
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief
Networks Support Vector Machines
![Page 38: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/38.jpg)
Example of a Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
cate
gorica
l
cate
gorica
l
contin
uous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
![Page 39: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/39.jpg)
Another Example of Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
cate
gorica
l
cate
gorica
l
contin
uous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
There could be more than one tree that fits the same data!
![Page 40: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/40.jpg)
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
![Page 41: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/41.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test DataStart from the root of
tree.
![Page 42: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/42.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
![Page 43: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/43.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
![Page 44: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/44.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
![Page 45: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/45.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
![Page 46: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/46.jpg)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
![Page 47: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/47.jpg)
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
![Page 48: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/48.jpg)
Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT
![Page 49: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/49.jpg)
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure: If Dt contains records that
belong the same class yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default class, yd
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Dt
?
![Page 50: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/50.jpg)
Hunt’s Algorithm
Don’t Cheat
Refund
Don’t Cheat
Don’t Cheat
Yes No
Refund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced
Married
TaxableIncome
Don’t Cheat
< 80K >= 80K
Refund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced
Married
![Page 51: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/51.jpg)
Tree Induction
Greedy strategy. Split the records based on an attribute test
that optimizes certain criterion.
Issues Determine how to split the records
• How to specify the attribute test condition?• How to determine the best split?
Determine when to stop splitting
![Page 52: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/52.jpg)
Tree Induction
Greedy strategy. Split the records based on an attribute test
that optimizes certain criterion.
Issues Determine how to split the records
• How to specify the attribute test condition?• How to determine the best split?
Determine when to stop splitting
![Page 53: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/53.jpg)
How to Specify Test Condition?
Depends on attribute types Nominal Ordinal Continuous
Depends on number of ways to split 2-way split Multi-way split
![Page 54: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/54.jpg)
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal
partitioning.
CarTypeFamily
Sports
Luxury
CarType{Family, Luxury} {Sports}
CarType{Sports,
Luxury}
{Family} OR
![Page 55: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/55.jpg)
Splitting Based on Continuous Attributes
Different ways of handling Discretization to form an ordinal
categorical attribute• Static – discretize once at the beginning• Dynamic – ranges can be found by equal
interval bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best
cut• can be more compute intensive
![Page 56: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/56.jpg)
Splitting Based on Continuous Attributes
TaxableIncome> 80K?
Yes No
TaxableIncome?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
![Page 57: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/57.jpg)
Tree Induction
Greedy strategy. Split the records based on an attribute test
that optimizes certain criterion.
Issues Determine how to split the records
• How to specify the attribute test condition?• How to determine the best split?
Determine when to stop splitting
![Page 58: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/58.jpg)
How to determine the Best Split
OwnCar?
C0: 6C1: 4
C0: 4C1: 6
C0: 1C1: 3
C0: 8C1: 0
C0: 1C1: 7
CarType?
C0: 1C1: 0
C0: 1C1: 0
C0: 0C1: 1
StudentID?
...
Yes No Family
Sports
Luxury c1c10
c20
C0: 0C1: 1
...
c11
Before Splitting: 10 records of class 0,10 records of class 1
Which test condition is the best?
![Page 59: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/59.jpg)
How to determine the Best Split
Greedy approach: Nodes with homogeneous class
distribution are preferred Need a measure of node impurity:
C0: 5C1: 5
C0: 9C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
![Page 60: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/60.jpg)
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
![Page 61: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/61.jpg)
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41
C0 N00 C1 N01
M0
M1 M2 M3 M4
M12
M34Gain = M0 – M12 vs M0 –
M34
![Page 62: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/62.jpg)
Measure of Impurity: GINI Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information
j
tjptGINI 2)]|([1)(
C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
![Page 63: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/63.jpg)
Examples for computing GINI
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
![Page 64: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/64.jpg)
Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children),
the quality of split is computed as,
where, ni = number of records at child i,
n = number of records at node p.
k
i
isplit iGINI
n
nGINI
1
)(
![Page 65: Data Mining: Introduction Based on slides from: Tan, Steinbach, Kumar.](https://reader035.fdocuments.net/reader035/viewer/2022062323/56649f075503460f94c1cc6d/html5/thumbnails/65.jpg)
Binary Attributes: Computing GINI Index
Splits into two partitions Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2 C1 5 1
C2 2 4
Gini=0.333
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194
Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333