San Diego Supercomputer Center National Partnership for ...

62
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SDSC Summer Institute 2004 SDSC Summer Institute 2004 TUTORIAL TUTORIAL Data Mining for Scientific Data Mining for Scientific Applications Applications Peter Shin Hector Jasso San Diego Supercomputer Center UCSD

description

 

Transcript of San Diego Supercomputer Center National Partnership for ...

Page 1: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SDSC Summer Institute 2004SDSC Summer Institute 2004

TUTORIALTUTORIALData Mining for Scientific Data Mining for Scientific

ApplicationsApplications

Peter Shin Hector Jasso

San Diego Supercomputer Center UCSD

Page 2: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Overview Introduction to data mining

• Definitions, concepts, applications• Machine learning methods for KDD

• Supervised learning – classification• Unsupervised learning – clustering

Cyberinfrastructure for data mining• SDSC/NPACI resources – hardware and software

Survey of Applications at SKIDL

Break

Hands on tutorial with IBM Intelligent Miner and SKIDLkit• Customer targeting• Microarray analysis (leukemia dataset)

Page 3: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining Definition

The search for interesting patterns and models,

in large data collections,

using statistical and machine learning methods, and high-performance computational

infrastructure.

Key point: applications are data-driven and compute-intensive

Page 4: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Analysis Levels and Infrastructure

• Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool)

• Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO)

• Machine learning (compute-intensive statistical methods)• Supervised – classification, prediction• Unsupervised – clustering

• Computational infrastructure needed at all levels – collections management, information integration, high-performance database systems, web services, grid services, scientific workflows, the global IT grid

Page 5: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The Case for Data Mining: Data Reality

• Deluge from new sources• Remote sensing• Microarray processing• Wireless communication• Simulation models• Instrumentation – microscopes, telescopes• Digital publishing• Federation of collections

• “5 exabytes (5 million terabytes) of new information was created in 2002” (source: UC Berkeley researchers Peter Lyman and Hal Varian)

• This is the result of a recent paradigm shift: from hypothesis-driven data collection to data mining

• Data destination: Legacy archives and independent collection activities

Page 6: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Knowledge Discovery Process

Collection

Processing/Cleansing/Corrections

Analysis/Modeling

Presentation/Visualization

Application/Decision Support

Management/Federation/Warehousing

Data

Knowledge

“Data is not information; information is not knowledge; knowledge is not wisdom.” Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, July 2004.

Page 7: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Characteristics of Data Mining Applications

• Data:• Lots of data, numerous sources

• Noisy – missing values, outliers, interference

• Heterogeneous – mixed types, mixed media

• Complex – scale, resolution, temporal, spatial dimensions

• Relatively little domain theory, few quantitative causal models

• Lack of valid ground truth

• Advice: don’t choose problems that have all these characteristics …

Page 8: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Scientific vs. Commercial Data MiningGoals:

• Science – Theories: Need for insight and theory-based models, interpretable model structures, generate domain rules or causal structures, support for theory development

• Commercial – Profits: black boxes OK

Types of data: • Science – Images, sensors, simulations• Commercial - Transaction data• Both - Spatial and temporal dimensions, heterogeneous

Trend – Common IT (information technology) tools fit both enterprises• Database systems (Oracle, DB2, etc), integration tools (Information

Integrator), web services (Blue Titan, .NET)• This is good!

Page 9: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Introduction to Machine Learning Basic machine learning theory Concepts and feature vectors Supervised and unsupervised learning Model development

training and testing methodology, model validation, overfitting confusion matrices

Survey of algorithms Decision Trees classification k-means clustering Hierarchical clustering Bayesian networks and probabilistic inference Support vector machines

Page 10: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Basic Machine Learning Theory

Basic inductive learning hypothesis:• Having a large number of observations, we can approximate the rule

that describes how the data was generated, and thus generate a model (using some algorithm)

No Free Lunch Theorem: • There is no ultimate algorithm: In the absence of prior information about

the problem, there are no reasons to prefer one learning algorithm over another.

Conclusion: • There is no problem-independent “best” learning system. Formal theory

and algorithms are not enough. • Machine learning is an empirical subject.

Page 11: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Concepts are described as feature vectors

Example: vehicles • Has wheels• Runs on gasoline• Carries people • Flies• Weighs less than 500 pounds

Boolean feature vectors for vehicles• car254 [ 1 1 1 0 0 ] • motorcyle14 [ 1 1 1 0 1 ] • airplane132 [ 1 1 1 1 0 ]

Page 12: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Easy to generalize to complex data types:

• Number of wheels• Fuel type• Carrying capacity• Flies• Weight

car254 [ 4, gas, 6, 0, 2000 ]motorcyle14 [ 2, gas, 2, 0, 400 ]airplane132 [ 10, jetfuel, 110, 1, 35000 ]

Most machine learning algorithms expect feature vectors, stored in text files or databases

Suggestions: • Identify the target concept• Organize your data to fit feature vector representation • Design your database schemas to support generation of data in

this format

Page 13: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Supervised vs. Unsupervised Learning

Supervised – Each feature vector belongs to a class (label). Labels are given externally, and algorithms learn to predict the label of new samples/observations.

Unsupervised – Finds structure in the data, by clustering similar elements together. No previous knowledge of classes needed.

Page 14: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Model development

Model validation

• Hold-out validation (2/3, 1/3 splits)• Cross validation, simple and n-fold (reuse)• Bootstrap validation (sample with replacement)• Jackknife validation (leave one out)

• When possible hide a subset of the data until train-test is complete.

Train Test Apply

Training and testing

Page 15: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

0%

20%

40%

60%

80%

100%

0 2 4 6 8

Algorithm Steps

Acc

urac

y

Train

Test

OverfittingOptimal Depth

v

v

Avoid overfitting

Page 16: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

0%

20%

40%

60%

80%

100%

0 2 4 6 8

Algorithm Steps

Acc

urac

y

Train

Test

OverfittingOptimal Depth

v

v

Avoid overfitting

Page 17: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Confusion matrices

124 15

8 84

Predicted

Actual

Negative

Negative Positive

Positive

Accuracy = (124 + 84) / (124 + 15 + 8 + 84) “proportion of predictions correct”

True positive rate = 84 / (8 + 84) “proportion of positive cases correctly identified”

False positive rate = 15 / (124 + 15) “proportion of negative cases incorrectly class as positive”

True negative rate = 124 / (124 + 15) “proportion of negative cases correctly identified”

False negative rate = 8 / (8 + 84) “proportion of positive cases incorrectly class as negative”

Precision = 84 / (15 + 84) “proportion of predicted positive cases that were correct”

Page 18: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Annual

PrecipitationEcosystem

Page 19: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Desert 2

Desert 5

Prairie 63

Precipitation > 63?YESNO

Page 20: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116Prairie 63

Desert 2

Desert 5

Prairie 63

Desert 2

Desert 5

Precipitation > 5?

Precipitation > 63?

YESNO

NOYES

Page 21: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

If (Precip > 63 ) then “Forest”

else If (Precip > 5) then “Prairie”

else “Desert”

Classification accuracy on training data is 100%

2 0 0

0 3 0

0 0 1

D F P

F

D

P

Actual

Learned Model

Predicted

Confusion matrix

Page 22: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Testing Set Results

Desert 8

Forest 100

Prairie 55

Desert 4

Forest 116

Prairie 72

IF(Precip > 63 ) then Forest

Else If (Precip > 5) then Prairie

Else Desert

Learned Model

Test Data

Result: Accuracy 67% Model shows overfitting, generalizes poorly

Prairie

Forest

Prairie

Desert

Forest

Forest

True Predicted1 0 1

0 2 0

0 1 1

D F P

F

D

P

Actual

Predicted

Confusion matrix

Page 23: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Pruning to improve generalizationPruned Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5

Precipitation < 60?

IF(Precip < 60 ) then Desert

Else, [P(Forest) = .75] &

[P(Prairie) = .25]

Page 24: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Decision Trees Summary

• Simple to understand• Works with mixed data types• Heuristic search sensitive to local minima • Models non-linear functions • Handles classification and regression • Many successful applications• Readily available tools

Page 25: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Overview of Clustering• Definition:

• Clustering is the discovery of classes• Unlabeled examples => unsupervised learning.

• Survey of Applications• Grouping of web-visit data, clustering of genes according to their

expression values, grouping of customers into distinct profiles,

• Survey of Methods• k-means clustering• Hierarchical clustering• Expectation Maximization (EM) algorithm• Gaussian mixture modeling

• Cluster analysis • Concept (class) discovery• Data compression/summarization• Bootstrapping knowledge

Page 26: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

Precipitation Temperature

8 81

71 70

62 63

49 45

17 76

32 49

Page 27: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 28: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 29: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 30: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 31: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 32: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 33: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

Page 34: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

Page 35: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – k-Means

C1 70 - 85 0-25 Desert

C2 35 - 60 25 - 55 Prairie

C3 50 – 80 50 – 80 Forest

Cluster Temperature Precipitation Ecosystem

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

Page 36: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Using k-means

• Requires a priori knowledge of ‘k’ • The final outcome depends on the initial choice

of k-means -- inconsistency• Sensitive to the outliers, which can skew the

means of their clusters• Favors spherical clusters – clusters may not

match domain boundaries• Requires real-valued features

Page 37: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Cyberinfrastructure for Data Mining

• Resources – hardware and software (analysis tools and middleware)

• Policies – allocating resources to the scientific community. Challenges to the traditional supercomputer model. Requirements for interactive and real-time analysis resources.

Page 38: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

NSF TeraGrid Building Integrated National CyberInfrastructure

• Prototype for CyberInfrastructure • Ubiquitous computational resources• Plug-in compatibility

• National Reach: • SDSC, NCSA, CIT, ANL, PSC

• High Performance Network: • 40 Gb/s backbone, 30 Gb/s to each site

• Over 20 Teraflops compute power• Over 1PB Online Storage• 8.9PB Archival Storage

Page 39: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SDSC is Data-Intensive Center

39

Page 40: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SDSC is Data-Intensive Center

40

Page 41: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SDSC Machine Room Data Architecture

Philosophy: enable SDSC configuration to serve the grid as Data Center

• .5 PB disk• 6 PB archive• 1 GB/s disk-to-tape• Optimized support for DB2 /Oracle

Blue Horizon

HPSS

LAN (multiple GbE, TCP/IP)

SAN (2 Gb/s, SCSI)

Linux Cluster, 4TF

Sun F15K

WAN (30 Gb/s)

SCSI/IP or FC/IP

FC Disk Cache (400 TB)

FC GPFS Disk (100TB)

200 MB/s per controller

Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives

30 MB/s per drive

Database Engine

Data Miner

Vis Engine

Local Disk (50TB)

Power 4

Power 4 DB

Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored

Page 42: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SDSC IBM Regatta - DataStar

• 100+ TB Disk• Numerous fast CPUs • 64 GB of RAM per node• DB2 v8.x ESE• IBM Intelligent Miner• SAS Enterprise Miner • Platform for high-performance

database, data mining, comparative IT studies …

Page 43: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining Tools used at SDSC• SAS Enterprise Miner (Protein crystallization - JCSG)

• IBM Intelligent Miner (Protein crystallization - JCSG, Corn Yield – Michigan State University, Security logs - SDSC)

• CART (Protein crystallization - JCSG)

• Matlab SVM package (TeraBridge health monitoring – UCSD Structural Engineering Department, North Temperate Lakes Monitoring - LTER)

• PyML (Text Mining – NSDL, Hyperspectral data - LTER)

• SKIDLkit by SDSC (Microarray analysis – UCSD Cancer Center, Hyperspectral data - LTER)

• SVMlight (Hyperspectral data, LTER)

• LSI by Telecordia (Text Mining – NSDL)

• CoClustering by Fair Isaac (Text Mining – NSDL)

• Matlab Bayes Net package

• WEKA

Page 44: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SKIDLkit

• Toolkit for feature selection and classification• Filter methods• Wrapper methods• Data normalization• Feature selection• Support Vector Machine & Naïve Bayesian Clustering• http://daks.sdsc.edu/skidl

• Will use it in the hands-on demo…

Page 45: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Survey of Applications at SDSC

• Sensor networks for bridge monitoring (with Structural Engineering Dept., UCSD)

• Text mining the NSDL (National Science Digital Library) collection

• Hyperspectral remote sensing data for groundcover classification (with Long Term Ecological Research Network - LTER)

• Microarray analysis for tumor detection (with UCSD Cancer Center)

Page 46: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Sensor Networks for Bridge Monitoring• Task: detection & classification

• Identify damaged piers based on the data stream of acceleration measurements.

• Determine which sensors are key to determining bridge health.

• Multi-resolution analysis Rational resource management.

• Testbed: • Humboldt Bay Bridge with 8 piers.

• Assumptions: • Damage only happens at the lower end

of each pier (location of plastic hinge)• There is only one damaged pier each

time.

Page 47: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 48: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Text Mining the NSDLVariously FormattedDocuments

StripFormatting

Pick out content words using “stop lists”

Stemming

Discard words that appear in every

document or only one

Word count, Term

Weighting

Generate Term Document

Matrix

Query: for a list of words, get docs

with highest score

VariousRetrievalSchemes

(LSI, Classification, or

clustering modules)

Processing pipeline

Page 49: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Hyperspectral Image Classification

• Characteristics of the data• Over 200 bands

• Small number of samples through labor-intensive collecting process

• Collaboration with the Long Term Ecological Research Network

• Tasks:• Classify the vegetation (e.g.

Juniper tree, Sage, etc.)

• Identify key bands

• Detect spatio-temporal patterns

Page 50: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 51: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Microarray Analysis for Tumor Detection

Characteristics of the Data:• 88 prostate tissue samples:

• 37 labeled “no tumor”,• 51 labeled “tumor”

• Each tissue with 10,600 gene expression measurements

• Collected by the UCSD Cancer Center, analyzed at SDSC

Tasks:• Build model to classify new,

unseen tissues as either “no tumor” or “tumor”

• Identify key genes to determine their biological significance in the process of cancer

Page 52: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Some genes are more useful than others for building classification models

Example: genes 36569_at and 36495_at are useful

No Tumor

Tumor

Page 53: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Results using independent test set

Page 54: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Break

Page 55: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Hands-on Analysis

• Decision Tree classification with IBM Intelligent Miner

• Using classification models to make rational decisions

Page 56: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining Example: Targeting Customers

• Problem Characteristics:1. We make $50 profit on a sale of $200 shoes.

2. A preliminary study shows that people who make over $50k will buy the shoes at a rate of 5% when they receive the brochure.

3. People who make less than $50k will buy the shoes at a rate of 1% when they receive the brochure.

4. It costs $1 to send a brochure to a potential customer.

5. In general, we do not know whether a person will make more than $50k or not.

Page 57: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Available Information

• Variable Description• Please refer to the hand-out.

Page 58: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Possible Marketing Plans

• We will send out 30,000 brochures.

• Plan A: Ignore data and randomly send brochures (a.k.a ran-dumb plan)

• Plan B: Use data mining to target a specific group with high probabilities of responding (a.k.a Intelligent Target (IT) plan)

Page 59: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Plan A (ran-dumb plan)• Strategy:

• Send brochures to anyone

• Cost of sending one brochure = $1• Probability of Response

• 1% of the population who make <= $50k (76%)• 5% of the population who make > $50k (24%)• Resulting in:(1% * 76% + 5% * 24%) = 1.96% final response rate

• Earnings• Expected profit from one brochure = (Probability of response * profit – Cost of a

brochure)(1.96% * $50 - $1) = -$0.02 • Expected Earning = Expected profit from one brochure * number of brochures sent

-$0.02 * 30000 = -$600

Page 60: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Plan B (Intelligent Target (IT) plan)• Strategy:

• Send out brochures to only to: married, college or above, managerial/professional/sales/tech. support/protective service/armed forces, age >= 28.5, hours_per_week >= 31

• Cost of sending one brochure = $1• Probability of Response

• 1% of the population who make <= $50k (20.6%)• 5% of the population who make > $50k (79.4%)• Resulting in:(1% * 20.6% + 5% * 79.4%) = 4.176% final response rate

• Earnings• Expected profit from one brochure = (Probability of response * profit – Cost of a

brochure)(4.176% * $50 - $1) = $1.088• (Probability of response * profit – Cost of a flier) * number of fliers

$1.088 * 30000 = $32,640

Page 61: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Comparison of Two Plans

• Expected earning from ran-dumb plan• -$600

• Expected earning from IT plan• $32,640

• Net Difference• $32,640 – (-$600) = $33,240

Page 62: San Diego Supercomputer Center National Partnership for ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Acknowledgements

• Original source Census Bureau (1994)

• Data processed and donated by Ron Kohavi and Barry Becker (Data Mining and Visualization, SGI)