Introduction to Data Mining Massive quantities of data exist on computers Data mining is a way to...

42
Introductio n to Data Mining Massive quantities of data exist on computers Data mining is a way to use these

Transcript of Introduction to Data Mining Massive quantities of data exist on computers Data mining is a way to...

Introduction to Data Mining

Massive quantities of data exist on computers

Data mining is a way to use these

data to learn

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-2

Definition

• DATA MINING: exploration & analysis– by automatic means– of large quantities of data– to discover actionable patterns & rules

• Data mining is a way to use massive quantities of data that businesses generate

• GOAL - improve marketing, sales, customer support through better understanding of customers

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-3

Retail Outlets

• Bar coding & scanning generate masses of data– customer service– inventory control– MICROMARKETING– CUSTOMER PROFITABILITY ANALYSIS– MARKET-BASKET ANALYSIS

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-4

Political Data MiningGrossman et al., 10/18/2004, Time, 38

• 2004 Election– Republicans: VoterVault

• From Mid-1990s• About 165 million voters• Massive get-out-the-vote

drive for those expected to vote Republican

– Democrats: Demzilla• Also about 165 million voters• Names typically have 200 to

400 information items

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-5

Medical Diagnosis

J. Morris, Health Management Technology Nov 2004, 20, 22-24

• Electronic Medical Records– Associated Cardiovascular

Consultants• 31 physicians• 40,000 patients per year,

southern New Jersey– Data mined to identify efficient

medical practice– Enhance patient outcomes– Reduced medical liability

insurance

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-6

Mayo Clinic

Swartz, Information Management Journal Nov/Dec 2004, 8

• IBM developed EMR program– Complete records on almost

4.4 million patients

– Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-7

Business Uses of Data Mining

1. Customer profilingIdentify profitability of customers

2. TargetingDetermine characteristics of most profitable

customers

3. Market-Basket AnalysisDetermine correlation of purchases by profile

Part of Customer Relationship Management

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-8

Reasons why Data Mining is now effective

• Data are there

• Data are warehoused (computerized)– Walmart: 35 thousand queries per week

• Computing economically available

• Competitive pressure

• Commercial products available

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-9

Trends

• Every business is service– hotel chains record your

preferences– car rental companies the same– service versus price

• credit card companies• long distance providers• airlines• computer retailers

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-10

Trends

• Mass Customization– produce tailored products from

standardized components• Levi-Strauss - custom fit jeans• The Custom Foot• Andersen Windows• Individual, Inc.

– electronic clipping

– customer profiles of interests

– send custom newsletter

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-11

Trends

• Information as Product– Custom Clothing Technology Corporation

• fit jeans, other clothing– Lands End– J. Crew

• INFORMATION BROKERING– IMS - collects prescription data from pharmacies,

sells to drug firms– AC Nielsen - TV

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-12

Trends

• Commercial Software Available– using statistical, artificial intelligence tools

that have been developed• Enterprise Miner SAS• Intelligent Miner IBM• Clementine SPSS• PolyAnalyst Megaputer• Specialty products

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-13

How Data Mining Is Being Used

• U.S. Government – track down Oklahoma

City bombers, Unabomber, many others

– Treasury department - international funds transfers, money laundering

– Internal Revenue Service

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-14

How Data Mining Is Used

• Safeway– offer Safeway Savings Club

card• users given discounts• users must give personal

information• every use, collect data

– identify aggregate patterns (what sells well together; what should be sold together)

• sell names for 5.5 cents per name to suppliers

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-15

How Data Mining Is Used

• Firefly– asks members to

rate music and movies

– subscribers clustered– clusters get custom-

designed recommendations

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-16

Cross-selling

• USAA – insurance– doubled number of products held by

average customer due to data mining– detailed records on customers– predict products they might need

• Fidelity Investments– regression - what makes customer loyal

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-17

Warranty Claims Routing

• Diesel engine manufacturer– stream of warranty claims– examine each by expert

• determine whether charges are reasonable & appropriate

• think of expert system to automate claims processing

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-18

Retaining Good Customers

• Customer loss:– Banks - Attrition– Cellular Phone Companies - Churn

• study who might leave, why• Southern California Gas

– customer usage, credit information

– direct mail contact - most likely best billing plan– who is price sensitive

• Who should get incentives, whom to keep

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-19

Fairbank & Morris

• Credit card company’s most valuable asset:

– INFORMATION ABOUT CUSTOMERS

• Signet Banking Corporation

– obtained behavioral data from many sources

– built predictive models

– aggressively marketed balance transfer card

• First Union

– who will move soon - improve retention

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-20

Methodology

Analyzing dataGiven management goals and that

management can translate knowledge into action

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-21

Basic Styles

• Top-Down: HYPOTHESIS TESTING– SUPERVISED– have a theory, experiment to prove or disprove– SCIENCE

• Bottom-Up: KNOWLEDGE DISCOVERY– UNSUPERVISED– start with data, see new patterns– CREATIVITY

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-22

Hypothesis Testing

• Generate theory

• Determine data needed

• Get data

• Prepare data

• Build computer model

• Evaluate model results– confirm or reject hypotheses

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-23

Generate Theory

• Study

• Systematically tie different input sources together (MENTAL MODEL)– What causes sales volume?

• sales rep performance• economy, seasonality• product quality, price, promotion, location

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-24

Generate Theory

• Brainstorm:– diverse representatives for broad coverage

of perspectives (electronic)– keep under control (keep positive)– generate testable hypotheses

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-25

Define Data Needed

• Determine data needed to test hypothesis– Lucky - query existing database– More often - gather

• pull together from diverse databases, survey, buy

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-26

Locate Data

• Usually scattered or unavailable• Sources: warranty claims

point-of-sale data (cash register records)

medical insurance claims

telephone call detail records

direct mail response records

demographic data, economic data

• PROFILE: counts, summary statistics, cross-tabs, cleanup

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-27

Prepare Data for Analysis

• Summarize: too much - no discriminant information too little - swamped with useless

detail• Process for computer: EBCDIC, ASCII• Data encoding: how data are recorded can vary -

may have been collected with specific purpose (CAL omitting LA)

• Textual data: avoid if possible (may need to code)• Missing values: missing salary - use mean?

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-28

Build Computer Model

• Convert mental model into quantitative– roamers less sensitive to price than others

• threshold defining roamer• average price per call, or number of calls above

price level

– families with children in high school most likely to respond to home equity loan offer

• identify families with, without high school age• past data - responded or didn’t

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-29

Evaluate Model

• Determine if hypotheses supported– statistical practice– test rule-based systems for accuracy

• Requires both business and analytic knowledge

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-30

SUPERVISEDDorn, National Underwriter Oct 18, 2004, 34,39

• Health care fraud– Use statistics to identify

indicators of fraud or abuse

– Can rapidly sort through large databases

• Identify patterns different from norm

– Moderately successful• But only effective on

schemes already detected

• To benefit firm, need to identify fraud before paying claim

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-31

Knowledge Discovery

• Machine learning?– Usually need intelligent analyst

• Directed: explain value of some variable• Undirected: no dependent variable selected

– identify patterns

• Use undirected to recognize relationships; use directed to explain once found

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-32

Directed

• Goal-oriented• Examples: If discount applies, impact on products

-

who is likely to purchase credit insurance?Predicted profitability of new customer - what to bundle with a particular package

• Identify sources of preclassified data• Prepare data for analysis• Built & train computer model• Evaluate

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-33

Identify Data Sources

• Best - existing corporate data warehouse– data clean, verified, consistent, aggregated

• Usually need to generate– most data in form most efficient for designed

purpose– historical sales data often purged for dormant

customers (but you need that information)

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-34

Prepare Data• Put in needed format for computer

• Make consistent in meaning

• Need to recognize what data are missingchange in balance = new – old

add missing but known-to-be-important data

• Divide data into training, test, evaluation

• Decide how to treat outliers– statistically biasing, but may be most important

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-35

Build & Train Model

• Regression - human builds (selects IVs)

• Automatic systems train– give it data, let it hammer

• OVERFITTING:– fit the data– TEST SET a means to evaluate model

against data not used in training• tune weights before using to evaluate

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-36

Evaluate Model

• ERROR RATE: proportion of classifications in evaluation set that were wrong

• too little training: poor fit on training data and poor error rate

• optimal training: good fit on both• too much training: great fit on training

data and poor error rate

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-37

Undirected Discovery• What items sell together? Strawberries & cream

– Directed: What items sell with tofu? tabasco• Long distance caller market segmentation

– Uniform usage - weekday & weekend, spikes on holidays

– After segmentation:

high & uniform except for several months of nothing

high credit worthiness & profitability college students

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-38

UNSUPERVISED

Dorn, National Underwriter Oct 18, 2004, 34,39

• Health care fraud– Look at historical claim

submissions• Build ad hoc model to

compare with current claims

– Assign similarity score to fraudulent claims

– Predict fraud potential

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-39

Undirected Process

• Identify data sources• Prepare data• Build & train computer model• Evaluate model• Apply model to new data

• Identify potential targets for undirected

• Generate new hypotheses to test

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-40

Identify potential targets

• Why

• Who

• When

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-41

Generate hypotheses

• Any commonalities in data?

• Are they useful?– Many adults watch children’s movies

• chaperones are an important market segment• they probably make final decision

• When hypothesis is generated, that determines data needed

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

1-42

Bank Case Study• Directed knowledge discovery to recognize likely

prospects for home equity loan• training set - current loan holders• developed model for propensity to borrow • got continuous scores, ranked customers• sent top 11% material

• Undirected: segmented market into clusters• in one, 39% had both business & personal accounts• cluster had 27% of the top 11%

• Hypothesis: people use home equity to start business