Measuring Numerical Data HOW MUCH?? Absolute Quantities - ACTUAL Relative Quantities - COMPARING.
Introduction to Data Mining Massive quantities of data exist on computers Data mining is a way to...
-
Upload
debra-cameron -
Category
Documents
-
view
218 -
download
0
Transcript of Introduction to Data Mining Massive quantities of data exist on computers Data mining is a way to...
Introduction to Data Mining
Massive quantities of data exist on computers
Data mining is a way to use these
data to learn
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-2
Definition
• DATA MINING: exploration & analysis– by automatic means– of large quantities of data– to discover actionable patterns & rules
• Data mining is a way to use massive quantities of data that businesses generate
• GOAL - improve marketing, sales, customer support through better understanding of customers
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-3
Retail Outlets
• Bar coding & scanning generate masses of data– customer service– inventory control– MICROMARKETING– CUSTOMER PROFITABILITY ANALYSIS– MARKET-BASKET ANALYSIS
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-4
Political Data MiningGrossman et al., 10/18/2004, Time, 38
• 2004 Election– Republicans: VoterVault
• From Mid-1990s• About 165 million voters• Massive get-out-the-vote
drive for those expected to vote Republican
– Democrats: Demzilla• Also about 165 million voters• Names typically have 200 to
400 information items
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-5
Medical Diagnosis
J. Morris, Health Management Technology Nov 2004, 20, 22-24
• Electronic Medical Records– Associated Cardiovascular
Consultants• 31 physicians• 40,000 patients per year,
southern New Jersey– Data mined to identify efficient
medical practice– Enhance patient outcomes– Reduced medical liability
insurance
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-6
Mayo Clinic
Swartz, Information Management Journal Nov/Dec 2004, 8
• IBM developed EMR program– Complete records on almost
4.4 million patients
– Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-7
Business Uses of Data Mining
1. Customer profilingIdentify profitability of customers
2. TargetingDetermine characteristics of most profitable
customers
3. Market-Basket AnalysisDetermine correlation of purchases by profile
Part of Customer Relationship Management
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-8
Reasons why Data Mining is now effective
• Data are there
• Data are warehoused (computerized)– Walmart: 35 thousand queries per week
• Computing economically available
• Competitive pressure
• Commercial products available
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-9
Trends
• Every business is service– hotel chains record your
preferences– car rental companies the same– service versus price
• credit card companies• long distance providers• airlines• computer retailers
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-10
Trends
• Mass Customization– produce tailored products from
standardized components• Levi-Strauss - custom fit jeans• The Custom Foot• Andersen Windows• Individual, Inc.
– electronic clipping
– customer profiles of interests
– send custom newsletter
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-11
Trends
• Information as Product– Custom Clothing Technology Corporation
• fit jeans, other clothing– Lands End– J. Crew
• INFORMATION BROKERING– IMS - collects prescription data from pharmacies,
sells to drug firms– AC Nielsen - TV
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-12
Trends
• Commercial Software Available– using statistical, artificial intelligence tools
that have been developed• Enterprise Miner SAS• Intelligent Miner IBM• Clementine SPSS• PolyAnalyst Megaputer• Specialty products
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-13
How Data Mining Is Being Used
• U.S. Government – track down Oklahoma
City bombers, Unabomber, many others
– Treasury department - international funds transfers, money laundering
– Internal Revenue Service
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-14
How Data Mining Is Used
• Safeway– offer Safeway Savings Club
card• users given discounts• users must give personal
information• every use, collect data
– identify aggregate patterns (what sells well together; what should be sold together)
• sell names for 5.5 cents per name to suppliers
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-15
How Data Mining Is Used
• Firefly– asks members to
rate music and movies
– subscribers clustered– clusters get custom-
designed recommendations
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-16
Cross-selling
• USAA – insurance– doubled number of products held by
average customer due to data mining– detailed records on customers– predict products they might need
• Fidelity Investments– regression - what makes customer loyal
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-17
Warranty Claims Routing
• Diesel engine manufacturer– stream of warranty claims– examine each by expert
• determine whether charges are reasonable & appropriate
• think of expert system to automate claims processing
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-18
Retaining Good Customers
• Customer loss:– Banks - Attrition– Cellular Phone Companies - Churn
• study who might leave, why• Southern California Gas
– customer usage, credit information
– direct mail contact - most likely best billing plan– who is price sensitive
• Who should get incentives, whom to keep
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-19
Fairbank & Morris
• Credit card company’s most valuable asset:
– INFORMATION ABOUT CUSTOMERS
• Signet Banking Corporation
– obtained behavioral data from many sources
– built predictive models
– aggressively marketed balance transfer card
• First Union
– who will move soon - improve retention
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-20
Methodology
Analyzing dataGiven management goals and that
management can translate knowledge into action
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-21
Basic Styles
• Top-Down: HYPOTHESIS TESTING– SUPERVISED– have a theory, experiment to prove or disprove– SCIENCE
• Bottom-Up: KNOWLEDGE DISCOVERY– UNSUPERVISED– start with data, see new patterns– CREATIVITY
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-22
Hypothesis Testing
• Generate theory
• Determine data needed
• Get data
• Prepare data
• Build computer model
• Evaluate model results– confirm or reject hypotheses
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-23
Generate Theory
• Study
• Systematically tie different input sources together (MENTAL MODEL)– What causes sales volume?
• sales rep performance• economy, seasonality• product quality, price, promotion, location
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-24
Generate Theory
• Brainstorm:– diverse representatives for broad coverage
of perspectives (electronic)– keep under control (keep positive)– generate testable hypotheses
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-25
Define Data Needed
• Determine data needed to test hypothesis– Lucky - query existing database– More often - gather
• pull together from diverse databases, survey, buy
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-26
Locate Data
• Usually scattered or unavailable• Sources: warranty claims
point-of-sale data (cash register records)
medical insurance claims
telephone call detail records
direct mail response records
demographic data, economic data
• PROFILE: counts, summary statistics, cross-tabs, cleanup
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-27
Prepare Data for Analysis
• Summarize: too much - no discriminant information too little - swamped with useless
detail• Process for computer: EBCDIC, ASCII• Data encoding: how data are recorded can vary -
may have been collected with specific purpose (CAL omitting LA)
• Textual data: avoid if possible (may need to code)• Missing values: missing salary - use mean?
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-28
Build Computer Model
• Convert mental model into quantitative– roamers less sensitive to price than others
• threshold defining roamer• average price per call, or number of calls above
price level
– families with children in high school most likely to respond to home equity loan offer
• identify families with, without high school age• past data - responded or didn’t
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-29
Evaluate Model
• Determine if hypotheses supported– statistical practice– test rule-based systems for accuracy
• Requires both business and analytic knowledge
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-30
SUPERVISEDDorn, National Underwriter Oct 18, 2004, 34,39
• Health care fraud– Use statistics to identify
indicators of fraud or abuse
– Can rapidly sort through large databases
• Identify patterns different from norm
– Moderately successful• But only effective on
schemes already detected
• To benefit firm, need to identify fraud before paying claim
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-31
Knowledge Discovery
• Machine learning?– Usually need intelligent analyst
• Directed: explain value of some variable• Undirected: no dependent variable selected
– identify patterns
• Use undirected to recognize relationships; use directed to explain once found
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-32
Directed
• Goal-oriented• Examples: If discount applies, impact on products
-
who is likely to purchase credit insurance?Predicted profitability of new customer - what to bundle with a particular package
• Identify sources of preclassified data• Prepare data for analysis• Built & train computer model• Evaluate
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-33
Identify Data Sources
• Best - existing corporate data warehouse– data clean, verified, consistent, aggregated
• Usually need to generate– most data in form most efficient for designed
purpose– historical sales data often purged for dormant
customers (but you need that information)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-34
Prepare Data• Put in needed format for computer
• Make consistent in meaning
• Need to recognize what data are missingchange in balance = new – old
add missing but known-to-be-important data
• Divide data into training, test, evaluation
• Decide how to treat outliers– statistically biasing, but may be most important
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-35
Build & Train Model
• Regression - human builds (selects IVs)
• Automatic systems train– give it data, let it hammer
• OVERFITTING:– fit the data– TEST SET a means to evaluate model
against data not used in training• tune weights before using to evaluate
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-36
Evaluate Model
• ERROR RATE: proportion of classifications in evaluation set that were wrong
• too little training: poor fit on training data and poor error rate
• optimal training: good fit on both• too much training: great fit on training
data and poor error rate
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-37
Undirected Discovery• What items sell together? Strawberries & cream
– Directed: What items sell with tofu? tabasco• Long distance caller market segmentation
– Uniform usage - weekday & weekend, spikes on holidays
– After segmentation:
high & uniform except for several months of nothing
high credit worthiness & profitability college students
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-38
UNSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
• Health care fraud– Look at historical claim
submissions• Build ad hoc model to
compare with current claims
– Assign similarity score to fraudulent claims
– Predict fraud potential
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-39
Undirected Process
• Identify data sources• Prepare data• Build & train computer model• Evaluate model• Apply model to new data
• Identify potential targets for undirected
• Generate new hypotheses to test
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-40
Identify potential targets
• Why
• Who
• When
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-41
Generate hypotheses
• Any commonalities in data?
• Are they useful?– Many adults watch children’s movies
• chaperones are an important market segment• they probably make final decision
• When hypothesis is generated, that determines data needed
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
1-42
Bank Case Study• Directed knowledge discovery to recognize likely
prospects for home equity loan• training set - current loan holders• developed model for propensity to borrow • got continuous scores, ranked customers• sent top 11% material
• Undirected: segmented market into clusters• in one, 39% had both business & personal accounts• cluster had 27% of the top 11%
• Hypothesis: people use home equity to start business