Data Mining

Data Mining

© 2006, HEC Montréal. www.hec.ca/sap/ERPsim

Data Mining

The majority of reports are based on known facts

BUT

The majority of reports are based on known facts

BUT

We don’t know what we don’t knowWe don’t know what we don’t know

Definition

Data mining is the process of discovering meaningful new correlations, patterns and trends by "mining" large amounts of stored data using pattern recognition technologies, as well as statistical and mathematical techniques.

(Ashby, Simms (1998))

Data Mining Examples

Market Based Analysis and Up-

Selling/Cross-Selling

Market Based Analysis and Up-

Selling/Cross-Selling

Pharmaceutical Industry:

Drug Effectiveness by Patient Type

Pharmaceutical Industry:

Drug Effectiveness by Patient Type

Defect Analysis in

Manufacturing

Defect Analysis in

Manufacturing

University and Employee

Recruitment

University and Employee

Recruitment

Employee Turnover

Predictions

Employee Turnover

Predictions

CreditRisk

Determination

CreditRisk

Determination

CreditCardFraud

CreditCardFraud

Customer Grouping and

Behaviour Prediction

Customer Grouping and

Behaviour Prediction

BusinessUnderstanding

DataUnderstanding

EvaluationDataPreparation

Modeling

Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria

Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits

Determine Data Mining GoalData Mining GoalsData Mining Success Criteria

Produce Project PlanProject PlanInitial Asessment of Tools and Techniques

Collect Initial DataInitial Data Collection Report

Describe DataData Description Report

Explore DataData Exploration Report

Verify Data Quality Data Quality Report

Data SetData Set Description

Select Data Rationale for Inclusion / Exclusion

Clean Data Data Cleaning Report

Construct DataDerived AttributesGenerated Records

Integrate DataMerged Data

Format DataReformatted Data

Select Modeling TechniqueModeling TechniqueModeling Assumptions

Generate Test DesignTest Design

Build ModelParameter SettingsModelsModel Description

Assess ModelModel AssessmentRevised Parameter Settings

Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models

Review ProcessReview of Process

Determine Next StepsList of Possible ActionsDecision

Plan DeploymentDeployment Plan

Plan Monitoring and MaintenanceMonitoring and Maintenance Plan

Produce Final ReportFinal ReportFinal Presentation

Review ProjectExperience Documentation

Deployment

CRISP – DM: Phases and TasksCRISP – DM: Phases and Tasks

CRISP-DM: CRoss Industry Standard Process for Data Mining Initiative launched Sept.1996

CRISP-DM: CRoss Industry Standard Process for Data Mining Initiative launched Sept.1996

SAP BI Analysis Process Designer (APD)

Data Mining Methods: Predictive vs Informative

Association Analysis

8

Association Analysis Data Mining

Cross-SellingRules

C

D

D

A

B

E

E

E

A

Customers

Products

B

C

D

What products / services are typically bought together?

Export rules to Web Shop

Use in merchandising

Informative: Association Analysis - Example

10

Small Example

Rule: Diapers -> Beer Support: 60% (3/5)

• 60% of all purchases have diapers and beer Confidence: 75% (3/4)

• If diapers are purchased, 75% chance of buying beer Lift: 1.25 (75/60)

• If diapers purchased, person is 1.25 times more likely to purchase beer

url: http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

11

Rule Identification

Brute force: Examine all combinations to see which have high support, confidence & lift

What is the problem with this approach?

Algorithms developed to reduce # of rules considered: Frequent itemsets (support), then high confidence

rules

url: http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Clustering

12

Clustering

Clustering is a data mining technique that creates groups of records that are:

• Similar to each other within a particular group • Very different across different groups

The degree of association between members is measured by all the characteristics specified in the analysis

Clustering helps the user explore vast amounts of data and organize it in a systematic way

Income

Age

High

Low High

Clustering

Clustering Process

ABC Analysis

16

Informative: ABC Classification

Use ABC to classify objects (such as customers, employees, vendors or products) based on a particular measure (such as revenue or profit).

Examples: Customers with revenue >$100M = Class “A”, etc Customers who generate top 20% of our revenue = Class “A”, etc Rank customers by their revenue:

• The top 20% on the list = Class “A”, etc OR• The first 50 customers = Class “A”, etc

Practical applications Classify customers into Platinum, Gold, Silver Rank vendors based on product quality (returned goods)

Informative: ABC Analysis - Example

Classification/Decision Trees

19

Customer Income Age Credit Rating Etc. Buying Behavior

Selected Customers -

Historical Data

(query)

Mick Jagger $ 10 000 48 Excellent … Yes

Elton John $ 3000 22 Fair … No

Tina Turner $ 8000 36 Excellent … Yes

Etc. … … … … …

How will other Customers behave?

New Data

(query)

Willie Nelson $ 6500 34 Fair …

Carol King $ 2000 63 Excellent …

Etc. … … … …

• Identify the factors driving customer behavior and predict future behavior

?

?

?

Predictive: Decision Tree

Model process:

A record in the query starts at the root node

A test (in the model) determines which node the record should go to next

All records end up in a leaf node

Interpreting the Results

Read the tree from top to bottom

Rule: If Age is less than 35 and Income is greater than $5000 and Credit standing is Fair, then the

customer has a 35% chance of buying the product

Age, then Income and credit rating, are the most influential attributes determining buying behavior.

Age

IncomeBuy100%

Won’t Buy100%

Credit Rating

Buy35%

Won’t Buy65%

Leaf Nodes

Root Node

Decision Node

<35>= 35

>$5000<=$5000

FairExcellent

Test

Predictive: Decision Tree

Play Golf Dataset

Case Outlook Temp Humidity Windy Play

a sunny hot high FALSE no

b sunny hot high TRUE no

c overcast hot high FALSE yes

d rainy mild high FALSE yes

e rainy cool normal FALSE yes

f rainy cool normal TRUE no

g overcast cool normal TRUE yes

h sunny mild high FALSE no

i sunny cool normal FALSE yes

j rainy mild normal FALSE yes

k sunny mild normal TRUE yes

l overcast mild high TRUE yes

m overcast hot normal FALSE yes

n rainy mild high TRUE no

Decision Tree of Golf Data

Play 9

Don’t Play 5

Play 2

Don’t Play 3

Play 3

Don’t Play 2

Play 4

Don’t Play 0

Play 2

Don’t Play 0

Play 0

Don’t Play 3

Play 0

Don’t Play 2

Play 3

Don’t Play 0

Outlook?

OvercastRain

Humidity?

< 70% > 70%

Windy?

True False

Sunny

Conclusion

The best way to explain the attribute “play” is with the attribute Outlook First conclusion, people always play when it’s

overcast On days it rains, the attribute Windy explains

whether people play or not On days when it’s sunny, the attribute

humidity explains when people play

Confidence and Support

Confidence refers to the relative frequency that an event occurs If golfers play 8 out of the 10 days it’s overcast

then we have 8/10 confidence that golfers will play on overcast days

Support refers to number of times an event occurs out of all instances If it’s only overcast 1 day in 100 then there is only

1/100 support for the rule given above

Decision Tree: Practical Applications

How can we reduce customer fraud? Analyze customer characteristics:

• Fraudulent behavior (Y or N), age, education, occupation, frequency of purchase, dollar value of purchase, etc.

Who is likely to “churn” (stop buying from us)? Analyze customer characteristics; who is:

• (1) still with us, and • (2) no longer “on board”, • Plus other demographic or transactional attributes...

Who is likely to be a credit risk? Analyze customer characteristics: who has:

• (1) not been a credit risk in the past, and • (2) who has been a credit risk in the past• Include relevant customer characteristics

Data Mining

Documents

Transcript of Data Mining