URI Dept of Computer Science and Statistics - An Overview and ...
Transcript of URI Dept of Computer Science and Statistics - An Overview and ...
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 11
An Overview and ExampleAn Overview and Exampleof Data Miningof Data Mining
Daniel T. Larose, Ph.D.Daniel T. Larose, Ph.D.Professor of StatisticsProfessor of Statistics
Director, Director, Data Mining @CCSUData Mining @CCSUEditor, Editor, Wiley Series on Methods and Applications in Data MiningWiley Series on Methods and Applications in Data Mining
[email protected]@ccsu.edu www.math.ccsu.edu/larose www.math.ccsu.edu/larose
University of Rhode IslandDepartment of Computer Science and Statistics
March 30, 2007
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 22
OverviewOverview• Part One: Part One:
– A Brief Overview of Data MiningA Brief Overview of Data Mining
• Part Two: Part Two: – An Example of Data Mining:An Example of Data Mining:– Modeling Response to Direct Mail MarketingModeling Response to Direct Mail Marketing
• But first, a shameless plug …But first, a shameless plug …
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 33
Master of Science in DM at CCSUMaster of Science in DM at CCSU FacultyFaculty
• Dr. Roger Bilisoly (from Ohio State Univ., Statistics)Dr. Roger Bilisoly (from Ohio State Univ., Statistics)– Text Mining, Intro to Data Mining Text Mining, Intro to Data Mining
• Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS)Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS)– Data Mining for Genomics and Proteomics, Biomarker DiscoveryData Mining for Genomics and Proteomics, Biomarker Discovery
• Dr. Zdravko Markov (from Sofia Univ, CS)Dr. Zdravko Markov (from Sofia Univ, CS)– Data Mining (CS perspective), Machine LearningData Mining (CS perspective), Machine Learning
• Dr. Daniel Miller (from UConn, Statistics)Dr. Daniel Miller (from UConn, Statistics)– Applied Multivariate Analysis, Mathematical Statistics II, Intro to Applied Multivariate Analysis, Mathematical Statistics II, Intro to
Data MiningData Mining
• Dr. Krishna Saha (from Univ of Windsor, Statistics)Dr. Krishna Saha (from Univ of Windsor, Statistics)– Intro to Data Mining using RIntro to Data Mining using R
• Dr. Daniel Larose (Program Director) (from UConn, Dr. Daniel Larose (Program Director) (from UConn, Statistics)Statistics)– Intro to Data Mining, Data Mining Methods, Applied Data Intro to Data Mining, Data Mining Methods, Applied Data
Mining, Web MiningMining, Web Mining
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 44
Master of Science in DM at CCSUMaster of Science in DM at CCSU Program (36 credits) Program (36 credits)
• Core Courses (27 credits) All available online.Core Courses (27 credits) All available online.– Stat 521 Introduction to Data Mining (4 cr)Stat 521 Introduction to Data Mining (4 cr)– Stat 522 Data Mining Methods (4 cr)Stat 522 Data Mining Methods (4 cr)– Stat 523 Applied Data Mining (4 cr)Stat 523 Applied Data Mining (4 cr)– Stat 525 Web MiningStat 525 Web Mining– Stat 526 Data Mining for Genomics and ProteomicsStat 526 Data Mining for Genomics and Proteomics– Stat 527 Text MiningStat 527 Text Mining– Stat 416 Mathematical Statistics IIStat 416 Mathematical Statistics II– Stat 570 Applied Multivariate AnalysisStat 570 Applied Multivariate Analysis
• Electives ( 6 credits. Choose two) Electives ( 6 credits. Choose two) – CS 570 Topics in Artificial Intelligence: Machine LearningCS 570 Topics in Artificial Intelligence: Machine Learning– CS 580 Topics in Advanced Database: Data MiningCS 580 Topics in Advanced Database: Data Mining– Stat 455 Experimental DesignStat 455 Experimental Design– Stat 551 Applied Stochastic ProcessesStat 551 Applied Stochastic Processes– Stat 567 Linear ModelsStat 567 Linear Models– Stat 575 Mathematical Statistics III Stat 575 Mathematical Statistics III – Stat 529 Current Issues in Data Mining Stat 529 Current Issues in Data Mining
• Capstone Requirement: Stat 599 Thesis (3 credits)Capstone Requirement: Stat 599 Thesis (3 credits)
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 55
Master of Science in DM at Master of Science in DM at CCSUCCSU• Only MS in DM that is entirely online.• Some courses available on campus.• Student must come to CCSU to present Thesis• We reach students in about 30 US States and a dozen
foreign countries• Half of our students already have master’s degrees• About 15% already have Ph.D.’s• Typical student is a mid-career professional• Backgrounds are diverse: Computer Science, Engineering,
Finance, Chemistry, Database Admin, Statistics, etc.• www.ccsu.edu/datamining
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 66
Graduate Certificate in Data Graduate Certificate in Data MiningMining• 18 Credits:18 Credits:• Required Courses Required Courses (12 credits) (12 credits)
– Stat 521 Introduction to Data Mining Stat 521 Introduction to Data Mining – Stat 522 Data Mining Methods and Models Stat 522 Data Mining Methods and Models – Stat 523 Applied Data Mining Stat 523 Applied Data Mining
• Elective Courses (6 credits. Choose Two):Elective Courses (6 credits. Choose Two):– Stat 525 Web Mining Stat 525 Web Mining – Stat 526 Data Mining for Genomics and Proteomics Stat 526 Data Mining for Genomics and Proteomics – Stat 527 Text Mining Stat 527 Text Mining – Stat 529 Current Issues in Data Mining Stat 529 Current Issues in Data Mining – Some other graduate-level data mining or statistics course, with Some other graduate-level data mining or statistics course, with
approval of advisor. approval of advisor.
• No Mathematical Statistics requirement.No Mathematical Statistics requirement.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 77
Material for Part I Drawn From:Material for Part I Drawn From:Discovering Knowledge in Data: Discovering Knowledge in Data: An Introduction to Data MiningAn Introduction to Data Mining
(Wiley, 2005)(Wiley, 2005)• Chapter 1. An Introduction to Data Mining Chapter 1. An Introduction to Data Mining • Chapter 2. Data Preprocessing Chapter 2. Data Preprocessing • Chapter 3. Exploratory Data Analysis Chapter 3. Exploratory Data Analysis • Chapter 4. Statistical Approaches to Chapter 4. Statistical Approaches to
Estimation and Prediction Estimation and Prediction • Chapter 5. K-Nearest Neighbor Chapter 5. K-Nearest Neighbor • Chapter 6. Decision Trees Chapter 6. Decision Trees • Chapter 7. Neural Networks Chapter 7. Neural Networks • Chapter 8. Hierarchical and K-Means Chapter 8. Hierarchical and K-Means
Clustering Clustering • Chapter 9. Kohonen networksChapter 9. Kohonen networks• Chapter 10. Association Rules Chapter 10. Association Rules • Chapter 11. Model Evaluation TechniquesChapter 11. Model Evaluation Techniques
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 88
Material for Part II Drawn From:Material for Part II Drawn From:Data Mining Methods and ModelsData Mining Methods and Models
(Wiley, 2006)(Wiley, 2006)
• Chapter 1. Dimension Reduction Chapter 1. Dimension Reduction MethodsMethods
• Chapter 2. Regression Modeling Chapter 2. Regression Modeling • Chapter 3. Multiple Regression and Chapter 3. Multiple Regression and
Model Building Model Building • Chapter 4. Logistic RegressionChapter 4. Logistic Regression• Chapter 5. Naïve Bayes Classification Chapter 5. Naïve Bayes Classification
and Bayesian Networksand Bayesian Networks• Chapter 6. Genetic Algorithms Chapter 6. Genetic Algorithms • Chapter 7. Case Study: Chapter 7. Case Study: Modeling Modeling
Response to Direct-Mail MarketingResponse to Direct-Mail Marketing
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 99
No Material Drawn From:No Material Drawn From:Data Mining the Web: Uncovering Data Mining the Web: Uncovering
Patterns in Web Content, Structure, and Patterns in Web Content, Structure, and UsageUsage
(Wiley, April 2007)(Wiley, April 2007)
• Part One: Web Structure MiningPart One: Web Structure Mining– Information Retrieval and Web SearchInformation Retrieval and Web Search– Hyperlink-Based RankingHyperlink-Based Ranking
• Part Two: Web Content MiningPart Two: Web Content Mining– ClusteringClustering– Evaluating ClusteringEvaluating Clustering– ClassificationClassification
• Part Three: Web Usage MiningPart Three: Web Usage Mining– Data Preprocessing, Data Preprocessing, – Exploratory Data Analysis, Exploratory Data Analysis, – Association Rules, Clustering, and Association Rules, Clustering, and
Classification for Web Usage MiningClassification for Web Usage Mining
• With Dr. Zdravko Markov, Computer With Dr. Zdravko Markov, Computer Science, CCSUScience, CCSU
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1010
Call for Book ProposalsCall for Book Proposals Wiley Series on Wiley Series on
Methods and Applications in Data Methods and Applications in Data MiningMining
• Suggested topics:Suggested topics:– Data Mining in BioinformaticsData Mining in Bioinformatics– Emerging Techniques in Data Mining Emerging Techniques in Data Mining (e.g., SVM)(e.g., SVM)– Data Mining with Evolutionary AlgorithmsData Mining with Evolutionary Algorithms– Drug Discovery Using Data MiningDrug Discovery Using Data Mining– Mining Data StreamsMining Data Streams– Visual Analysis in Data MiningVisual Analysis in Data Mining
• Books in press:Books in press:– Data Mining for Genomics and ProteomicsData Mining for Genomics and Proteomics, by Darius Dziuda, by Darius Dziuda– Practical Text Mining Using PerlPractical Text Mining Using Perl, by Roger Bilisoly , by Roger Bilisoly
• Contact Series Editor at [email protected] Series Editor at [email protected]
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1111
What is Data Mining?What is Data Mining?• ““Data mining is the analysis Data mining is the analysis
of (often large) of (often large) observationalobservational data sets to find data sets to find unsuspectedunsuspected relationships relationships and to and to summarizesummarize the data the data in novel ways that are both in novel ways that are both understandableunderstandable and and usefuluseful to the data owner.”to the data owner.”– David Hand, Heikki Mannila & David Hand, Heikki Mannila &
Padhraic Smyth, Padhraic Smyth, Principles of Data Principles of Data Mining, Mining, MIT Press, 2001 MIT Press, 2001
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1212
Why Data Mining?Why Data Mining?• ““We are drowning in information but We are drowning in information but
starved for knowledge.” starved for knowledge.” – John Naisbitt, John Naisbitt, MegatrendsMegatrends, 1984. , 1984.
• ““The problem is that there are not The problem is that there are not enough trained enough trained human human analysts available analysts available who are skilled at translating all of this who are skilled at translating all of this data into knowledge, and thence up the data into knowledge, and thence up the taxonomy tree into wisdom.”taxonomy tree into wisdom.”
– Daniel Larose, Daniel Larose, Discovering Knowledge in Data: An Discovering Knowledge in Data: An Introduction to Data MiningIntroduction to Data Mining, Wiley, 2005., Wiley, 2005.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1313
Need for Human DirectionNeed for Human Direction• Automation is no substitute for human
supervision and input. – Humans need to be actively involved
at every phase of data mining process.
•“Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.”
- Daniel Larose, - Daniel Larose, Discovering Knowledge in Discovering Knowledge in Data: An Introduction to Data MiningData: An Introduction to Data Mining, , Wiley, 2005.Wiley, 2005.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1414
““Data Mining is Easy to Do Data Mining is Easy to Do Badly”Badly”• Black box softwareBlack box software
– Powerful, “easy-to-use” data mining algorithmsPowerful, “easy-to-use” data mining algorithms– Makes their misuse dangerous. Makes their misuse dangerous. – Too easy to point and click your way to disaster.Too easy to point and click your way to disaster.
• What is needed:What is needed:– An understanding of the An understanding of the
underlying algorithmic and underlying algorithmic and statistical model structures.statistical model structures.
– An understanding of which An understanding of which algorithms are most appropriate algorithms are most appropriate in which situations and for which in which situations and for which types of data.types of data.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1515
CRISP-DM: Cross-Industry Standard CRISP-DM: Cross-Industry Standard Process for Data MiningProcess for Data Mining
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1616
CRISP: DM as a ProcessCRISP: DM as a Process1.1. Business / Research Understanding PhaseBusiness / Research Understanding Phase
Enunciate your objectives Enunciate your objectives 2.2. Data Understanding Phase: EDA Data Understanding Phase: EDA
3.3. Data Preparation Phase: Preprocessing Data Preparation Phase: Preprocessing
4.4. Modeling Phase: Fun and interesting! Modeling Phase: Fun and interesting!
5.5. Evaluation PhaseEvaluation Phase
Confluence of results? Objectives Met? Confluence of results? Objectives Met?
6.6. Deployment Phase: Use results to solve problem.Deployment Phase: Use results to solve problem. If desired: Use lessons learned to reformulate business / If desired: Use lessons learned to reformulate business /
research objective. research objective.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1717
What About Data Dredging?What About Data Dredging?
Data DredgingData Dredging““A sufficiently exhaustive search will A sufficiently exhaustive search will certainly throw up patterns of some certainly throw up patterns of some kind. Many of these patterns will kind. Many of these patterns will simply be a product of random simply be a product of random fluctuations, and will not represent fluctuations, and will not represent any underlying structure.”any underlying structure.”
David J. Hand, David J. Hand, Data Mining: Statistics and Data Mining: Statistics and More?More? The American StatisticianThe American Statistician, May, , May, 1998.1998.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1818
Guarding Against Data Dredging:Guarding Against Data Dredging:Cross-Validation is the KeyCross-Validation is the Key
• Partition the data into training set and test set. Partition the data into training set and test set. • If the pattern shows up in both data sets, If the pattern shows up in both data sets,
decreases the probability that it represents decreases the probability that it represents noise.noise.
• More generally, may use More generally, may use nn-fold cross-validation.-fold cross-validation.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1919
Inference and Huge Data SetsInference and Huge Data Sets• Hypothesis testing becomes sensitive at the huge Hypothesis testing becomes sensitive at the huge
sample sizes prevalent in data mining sample sizes prevalent in data mining applications.applications.– Even very tiny effects will be found significant.Even very tiny effects will be found significant.– So, data mining tends to So, data mining tends to de-emphasize inferencede-emphasize inference
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2020
Need for Transparency and Need for Transparency and InterpretabilityInterpretability• Data mining models should be Data mining models should be transparenttransparent
– Results should be interpretable by humansResults should be interpretable by humans• Decision Trees are transparentDecision Trees are transparent• Neural Networks tend to be opaqueNeural Networks tend to be opaque• If a customer complains about why he/she was If a customer complains about why he/she was
turned down for credit, we should be able to explain turned down for credit, we should be able to explain why, without saying “Our neural net said so.”why, without saying “Our neural net said so.”
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2121
Part Two:Part Two:Modeling Response to Direct Mail Modeling Response to Direct Mail MarketingMarketingBusiness Understanding Phase:Business Understanding Phase:
– Clothing Store Purchase DataClothing Store Purchase Data• Results of a direct mail marketing Results of a direct mail marketing
campaigncampaign• Task: Construct a classification Task: Construct a classification
modelmodel– For classifying customers as either For classifying customers as either
respondersresponders or or non-responders non-responders to to the marketing campaign, the marketing campaign,
– To reduce costs and increase To reduce costs and increase return-on-investmentreturn-on-investment
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2222
Data Understanding: Data Understanding: The Clothing Store datasetThe Clothing Store dataset
List of fields in the dataset (28,7999 customers, 51 fields)List of fields in the dataset (28,7999 customers, 51 fields)Customer ID: Unique, encrypted customer identification
Number of days the customer has been on file
Product uniformity (Low score = diverse spending patterns)
Zip Code Number of days between purchases Lifetime average time between visits
Number of purchase visitsMarkdown percentage on customer purchases Microvision® Lifestyle Cluster Type
Total net salesNumber of different product classes purchased Percent of Returns
Average amount spent per visitNumber of coupons used by the customer Flag: Credit card user
Amount spent at each of four different franchises (four variables)
Total number of individual items purchased by the customer Flag: Valid phone number on file
Amount spent in the past month, the past three months, and the past six months
Number of stores the customer shopped at Flag: Web shopper
Amount spent the same period last yearNumber of promotions mailed in the past year
15 variables providing the percentages spent by the customer on specific classes of clothing, including sweaters, knit tops, knit dresses, blouses, jackets, career pants, casual pants, shirts, dresses, suits, outerwear, jewelry, fashion, legwear, and the collectibles line. Also a variable showing the brand of choice (encrypted).
Gross margin percentageNumber of promotions responded to in the past year Target variable: Response to promotion
Number of marketing promotions on filePromotion response rate for the past year
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2323
Data Preparation and EDA Data Preparation and EDA PhasePhase• Not covered in this presentation.Not covered in this presentation.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2424
Modeling StrategyModeling Strategy• Apply principal components analysis to address Apply principal components analysis to address
multicollinearity. multicollinearity. • Apply cluster analysis. Briefly profile clusters.Apply cluster analysis. Briefly profile clusters.• Balance the training data set. Balance the training data set. • Establish baseline model performanceEstablish baseline model performance
– In terms of expected profit per customer contacted. In terms of expected profit per customer contacted. • Apply classification algorithms to training data set: Apply classification algorithms to training data set:
– CARTCART– C5.0 (C4.5)C5.0 (C4.5)– Neural networksNeural networks– Logistic regression.Logistic regression.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2525
Modeling Strategy continuedModeling Strategy continued
• Evaluate each model using test data set.Evaluate each model using test data set.• Apply misclassification costs in line with cost benefit Apply misclassification costs in line with cost benefit
table.table.• Apply overbalancing as a surrogate for Apply overbalancing as a surrogate for
misclassification costs.misclassification costs.– Find best overbalancing proportion.Find best overbalancing proportion.
• Combine predictions from four modelsCombine predictions from four models– Using model voting.Using model voting.– Using mean response probabilities.Using mean response probabilities.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2626
Principal Components Analysis Principal Components Analysis (PCA)(PCA)• Multicollinearity does not degrade prediction Multicollinearity does not degrade prediction
accuracy.accuracy.– But muddles individual predictor coefficients.But muddles individual predictor coefficients.
• Interested in predictor characteristics, customer Interested in predictor characteristics, customer profiling, etc?profiling, etc?– Then PCA is required.Then PCA is required.
• But, if interested But, if interested solelysolely in classification in classification (prediction, estimation),(prediction, estimation),– PCA not strictly required.PCA not strictly required.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2727
Report Two Model Sets:Report Two Model Sets:• Model Set A:Model Set A:
– Includes principal componentsIncludes principal components– All purpose model setAll purpose model set
• Model Set B:Model Set B:– Includes correlated predictors, not principal Includes correlated predictors, not principal
componentscomponents– Use restricted to classificationUse restricted to classification
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2828
Principal Components Analysis Principal Components Analysis (PCA)(PCA)
• Seven correlated variables.Seven correlated variables.– Two components extractedTwo components extracted– Account for 87% of Account for 87% of
variabilityvariability
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2929
Principal Components Analysis Principal Components Analysis (PCA)(PCA)• Principal Component 1Principal Component 1: :
– Purchasing HabitsPurchasing Habits– Customer general purchasing habitsCustomer general purchasing habits– Expect component to be strongly indicative of Expect component to be strongly indicative of
responseresponse
• Principal Component 2Principal Component 2: : – Promotion ContactsPromotion Contacts– Unclear whether component will be associated Unclear whether component will be associated
with responsewith response
• Components validated by test data setComponents validated by test data set
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3030
BIRCH Clustering AlgorithmBIRCH Clustering Algorithm
• Requires only one pass through data setRequires only one pass through data set– Scalable for large data setsScalable for large data sets
• Benefit: Analyst need not pre-specify number of Benefit: Analyst need not pre-specify number of clustersclusters
• Drawback: Sensitive to initial records encountered Drawback: Sensitive to initial records encountered – Leads to widely variable cluster solutionsLeads to widely variable cluster solutions
• Requires “outer loop” to find consistent cluster Requires “outer loop” to find consistent cluster solutionsolution
• Zhang, Ramakrishnan and Livny, Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering BIRCH: A New Data Clustering Algorithm and Its Applications, Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997.1997.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3131
BIRCH BIRCH ClustersClusters
• Cluster 3 shows:Cluster 3 shows:– Higher response for flag Higher response for flag
predictorspredictors– Higher averages for numeric Higher averages for numeric
predictorspredictors Cluster 1 Cluster 2 Cluster 3 z ln Purchase Visits –0.575 –0.570 1.011 z ln Total Net Sales –0.177 –0.804 0.971 z sqrt Spending Last One Month –0.279 –0.314 0.523 z ln Lifetime Average Time Between Visits 0.455 0.484 –0.835 z ln Product Uniformity 0.493 0.447 –0.834 z sqrt # Promotion Responses in Past Year –0.480 –0.573 0.950 z sqrt Spending on Sweaters –0.486 0.261 0.116
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3232
BIRCH ClustersBIRCH Clusters• Cluster 3 has highest Cluster 3 has highest
response rate (red).response rate (red).– Cluster 1: 7.6%Cluster 1: 7.6%– Cluster 2: 7.1%Cluster 2: 7.1%– Cluster 3: 33.0%Cluster 3: 33.0%
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3333
Balancing the DataBalancing the Data• For “rare” classes, For “rare” classes,
provides more provides more equitable distribution.equitable distribution.
• Drawback: Loss of Drawback: Loss of data:data:– Here, 40% of non-Here, 40% of non-
responders randomly responders randomly omittedomitted
– All responders retainedAll responders retained– Responders increases Responders increases
from 16.58% to 24.76%from 16.58% to 24.76%• Test data set should Test data set should
never be balancednever be balanced
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3434
False Positive vs. False Negative:False Positive vs. False Negative:Which is Worse?Which is Worse?
• For direct mail marketing, a For direct mail marketing, a false false negative errornegative error is probably worse than is probably worse than a false positive.a false positive.
• Generate misclassification costs Generate misclassification costs based on the observed data.based on the observed data.– Construct cost-benefit tableConstruct cost-benefit table
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3535
Decision Cost / Benefit AnalysisDecision Cost / Benefit AnalysisOutcome Classified Actual Cost Rationale
True Negative NoNo NoNo $0
No contact made; no
revenue lost
True Positive YesYes YesYes
-$26.4
0
(Anticipated revenue) –
(Cost of contact)
False Negative NoNo YesYes $28.4
0Loss of
anticipated revenue
False Positive YesYes NoNo $2.00 Cost of contact
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3636
Establish Baseline Model Establish Baseline Model PerformancePerformance• BenchmarksBenchmarks
– ““Don’t Send a Marketing Promotion to Anyone” Don’t Send a Marketing Promotion to Anyone” ModelModel
– ““Send a Marketing Promotion to Everyone” Send a Marketing Promotion to Everyone” ModelModel• Will compare candidate models against this baseline Will compare candidate models against this baseline
error rate.error rate.Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost
“Don’t Send Anyone” 5908 0 1151 0 16.3% $32,688.40 ($4.63 per customer)
“Send to Everyone” 0 1151 0 5908 83.7% -$18,570.40 (-$2.63 per customer)
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3737
Model Set A Model Set A (With 50% Balancing)(With 50% Balancing)
• No model beats benchmark of $2.63 profit per customer
• Misclassification costs had not been applied
• Now define FN cost = $28.40, FP cost = $2– Outperformed Outperformed
baseline “Send baseline “Send to everyone” to everyone” modelmodel
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer
Neural Network 4694 672 479 9.3%
1214 64.4% 24.0% -$0.24
CART 4348 829 322 6.9%
1560 65.3% 26.7% -$1.36
C5.0 4465 782 369 7.6%
1443 64.9% 25.7% -$1.03
Logistic Regression 4293 872 279 6.1%
1615 64.9% 26.8% -$1.68
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer
CART 754 1147 4 0.5%
5154 81.8% 73.1% -$2.81
C5.0 858 1143 8 0.9%
5050 81.5% 71.7% -$2.81
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3838
Model Set A: Model Set A: Effect of Misclassification CostsEffect of Misclassification Costs• For the 447 highlighted records:For the 447 highlighted records:
– Only 20.8% responded. Only 20.8% responded. – But model predicts positive response.But model predicts positive response.– Due to high false negative misclassification cost.Due to high false negative misclassification cost.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3939
Model Set A: Model Set A: PCA Component 1 is Best PredictorPCA Component 1 is Best Predictor• First principal component ($F-PCA-1), First principal component ($F-PCA-1),
Purchasing Habits, represents both the Purchasing Habits, represents both the root node split and the secondary splitroot node split and the secondary split– Most important factor for predicting responseMost important factor for predicting response
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4040
Over-Balancing as a Surrogate for Over-Balancing as a Surrogate for Misclassification CostsMisclassification Costs• Software limitation:Software limitation:• Neural network and logistic regression models in Neural network and logistic regression models in
Clementine:Clementine:– Lack methods for applying misclassification costsLack methods for applying misclassification costs
• Over-balancingOver-balancing is an alternate method which can is an alternate method which can achieve similar resultsachieve similar results
• Starves the classifier of instances of non-responseStarves the classifier of instances of non-response
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4141
Over-Balancing as a Surrogate for Over-Balancing as a Surrogate for Misclassification CostsMisclassification Costs• Neural network model resultsNeural network model results
– Three over-balanced models outperform baselineThree over-balanced models outperform baseline• Properly applied, over-balancing can be used as a Properly applied, over-balancing can be used as a
surrogate for misclassification costssurrogate for misclassification costs
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer No Balancing
(16.3% - 83.7%) 5865 124 1027 14.9%
43 25.7% 15.2% +$3.68
50% - 50% Balancing 4694 672 479
9.3% 1214
64.4% 24.0% -$0.24
65% - 35% Over-Balancing 1918 1092 59
3.0% 3990
78.5% 57.4% -$2.72
80% - 20% Over-Balancing 1032 1129 22
2.1%
4876 81.2%
69.4% -$2.75
90% - 10% Over-Balancing 592 1141 10
1.7% 5316
82.3% 75.4% -$2.72
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4242
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer Neural Network
885 1132 19 2.1%
5023 81.6% 71.4% -$2.73
CART 1724 1111 40 2.3%
4184 79.0% 59.8% -$2.81
C5.0 1467 1116 35 2.3%
4441 79.9% 63.4% -$2.77
Logistic Regression 2389 1106 45 1.8%
3519 76.1% 50.5% -$2.96
Over-Balancing as a Surrogate for Over-Balancing as a Surrogate for Misclassification CostsMisclassification Costs
• Apply 80% - 20% over-balancing to Apply 80% - 20% over-balancing to the other models.the other models.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4343
Combination Models: VotingCombination Models: Voting• Smoothes out strengths and weaknesses of each modelSmoothes out strengths and weaknesses of each model
– Each model supplies a prediction for each recordEach model supplies a prediction for each record– Count the votes for each recordCount the votes for each record
• Disadvantage of combination models:Disadvantage of combination models:– Lack of easy interpretabilityLack of easy interpretability
• Four competing combination models…Four competing combination models…
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4444
Combination Models: VotingCombination Models: VotingMail a Promotion only if:Mail a Promotion only if:• All fourAll four models predict response models predict response
– Protects against false positiveProtects against false positive– All four classification algorithms must agree on All four classification algorithms must agree on
a positive predictiona positive prediction• At least threeAt least three models predict response models predict response• At least twoAt least two models predict response models predict response• AnyAny model predicts response model predicts response
– Protects against false negativesProtects against false negatives
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4545
Combination Model
TN Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer Mail a Promotion Only
if All Four Models Predict Response
2772 1067 84 2.9%
3136 74.6% 45.6% -$2.76
Mail a Promotion Only if Three or Four Models
Predict Response 1936 1115 36
1.8% 3972
78.1% 56.8% -$2.90
Mail a Promotion Only if At Least Two Models
Predict Response 1207 1135 16
1.3% 4701
80.6% 66.8% -$2.85
Mail a Promotion if Any Model Predicts
Response 550 1148 3
0.5% 5358
82.4% 75.9% -$2.76
Combination Models: VotingCombination Models: Voting• None beat the logistic regression model: $2.96 profit per None beat the logistic regression model: $2.96 profit per
customercustomer• Perhaps combination models will do better with Model Collection Perhaps combination models will do better with Model Collection
B…B…
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4646
Model Collection B: Non-PCA Model Collection B: Non-PCA ModelsModels
• Models retain correlated variables Models retain correlated variables – Use restricted to prediction onlyUse restricted to prediction only
• Since the correlated variables are highly predictive Since the correlated variables are highly predictive – Expect Collection B will outperform the PCA modelsExpect Collection B will outperform the PCA models
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4747
Model Collection B: CART and C5.0Model Collection B: CART and C5.0• Using misclassification costs, and 50% Using misclassification costs, and 50%
balancingbalancing• Both models outperform the best PCA model Both models outperform the best PCA model
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer
CART 1645 1140 11 0.7%
4263 78.9% 60.5% -$3.01
C5.0 1562 1147 4 0.3%
4346 79.1% 61.6% -$3.04
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4848
Model TN
Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer
Neural Network 1301 1123 28 2.1%
4607 80.4% 65.7% -$2.78
CART 2780 1100 51 1.8%
3128 74.0% 45.0% -$3.02
C5.0 2640 1121 30 1.1%
3268 74.5% 46.7% -$3.15
Logistic Regression 2853 1110 41 1.4%
3055 73.3% 43.9% -$3.12
Model Collection B: Over-BalancingModel Collection B: Over-Balancing• Apply over-balancing as a surrogate for Apply over-balancing as a surrogate for
misclassification costs for all modelsmisclassification costs for all models• Best performance thus far.Best performance thus far.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4949
Combination Model
TN Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer Mail a Promotion Only
if All Four Models Predict Response
3307 1065 86 2.5%
2601 70.9% 38.1% -$2.90
Mail a Promotion Only if Three or Four Models
Predict Response 2835 1111 40
1.4% 3073
73.4% 44.1% -$3.12
Mail a Promotion Only if At Least Two Models
Predict Response 2357 1133 18
0.7% 3551
75.8% 50.6% -$3.16
Mail a Promotion if Any Model Predicts
Response 1075 1145 6
0.6% 4833
80.8% 68.6% -$2.89
Combination Models: VotingCombination Models: Voting• Combine the four models via voting and 80%-20% Combine the four models via voting and 80%-20%
over-balancingover-balancing• Synergy: Combination model outperforms any Synergy: Combination model outperforms any
individual model.individual model.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5050
Combining Models Using Combining Models Using Mean Response ProbabilitiesMean Response Probabilities• Combine the confidences that each Combine the confidences that each
model reports for its decisionsmodel reports for its decisions– Allows finer tuning of the decision spaceAllows finer tuning of the decision space
• Derive a new variable:Derive a new variable:– Mean Response Probability Mean Response Probability (MRP):(MRP):
•Average of response confidences of the four Average of response confidences of the four models.models.
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5151
Combining Models Using Combining Models Using Mean Response ProbabilitiesMean Response Probabilities• Multi-modality due to the discontinuity of the Multi-modality due to the discontinuity of the
transformation used in derivation of MRPtransformation used in derivation of MRP
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5252
Combining Models Using Combining Models Using Mean Response ProbabilitiesMean Response Probabilities• Where shall we define response vs. non-response?Where shall we define response vs. non-response?
– Recall that FN is 14.2 times worse than FP Recall that FN is 14.2 times worse than FP – Set partitions on the low side => fewer FN decisions are madeSet partitions on the low side => fewer FN decisions are made
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5353
Combining Models Using Combining Models Using Mean Response ProbabilitiesMean Response Probabilities• Optimal partition: near Optimal partition: near
50%.50%.• Mail a promotion to a Mail a promotion to a
prospective customer prospective customer only if the mean response only if the mean response probability is at least 50%probability is at least 50%
• Best model in case study.Best model in case study.– MRP = 0.51 MRP = 0.51
• $3.1744 profit $3.1744 profit – ““send to everyone” send to everyone”
• $2.62 profit$2.62 profit– 20.7% profit 20.7% profit
enhancement enhancement (54.44 cents)(54.44 cents)
Combination Model
TN Cost $0
TP Cost
– $26.4
FN Cost
$28.40
FP Cost $2.00
Overall Error Rate
Overall Cost per
Customer
95.095.0
:MRPMRP
Partition 5648 353 798 12.4%
260 42.4%
15.0% +$1.96
85.085.0
:MRPMRP
Partition 3810 994 157 4.0%
2098 67.8% 31.9% -$2.49
65.065.0
:MRPMRP
Partition 2995 1104 47 1.5%
2913 72.5% 41.9% -$3.11
54.054.0
:MRPMRP
Partition 2796 1113 38 1.3%
3112 73.7%
44.6% -$3.13
52.052.0
:MRPMRP
Partition 2738 1121 30 1.1%
3170 73.9% 45.3% -$3.1736
51.051.0
:MRPMRP
Partition 2686 1123 28 1.0%
3222 74.2% 46.0% -$3.1744
50.050.0
:MRPMRP
Partition 2625 1125 26 1.0%
3283 74.5% 46.9% -$3.1726
46.046.0
:MRPMRP
Partition 2493 1129 22 0.9%
3415 75.2% 48.7% -$3.166
42.042.0
:MRPMRP
Partition 2369 1133 18 0.8%
3539 75.7% 50.4% -$3.162
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - LaroseURI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5454
SummarySummary
• For more on this Case Study, see For more on this Case Study, see Data Data Mining Methods and ModelsMining Methods and Models (Wiley, 2006) (Wiley, 2006)
• So, the best part about all this is:So, the best part about all this is:– Data mining is fun!Data mining is fun!– If you love to play with data, and you If you love to play with data, and you
love to construct and evaluate models, love to construct and evaluate models, then data mining is for you.then data mining is for you.