Cases: Banking credit record
description
Transcript of Cases: Banking credit record
1
Business System Analysis &
Decision Making – Data
Mining and Web Mining
Zhangxi LinISQS 5340
Summer II 2006
2
Outline
Introduction to data mining & text mining Constructing a decision tree using SAS
Enterprise Miner Web mining
3
Data Mining and Text Mining
4
Review - Decision Tree (1)
Total: 10Accept: 4Reject: 6
Accuracy: 40%Coverage: 100%
Gender
Female
Male
Total: 5Accept: 3Reject: 2
Accuracy: 60%Coverage: 75%
Total: 5Accept: 1Reject: 4
Accuracy: 20%Coverage: 25%
Credit CardInsurance
Yes
No
Total: 2Accept: 2Reject: 0
Accuracy: 100%Coverage: 50%
Total: 3Accept: 1Reject: 2
Accuracy: 33.3%Coverage: 25%
5
Review - Decision Tree (2)
Total: 10Accept: 4Reject: 6
Accuracy: 40%Coverage: 100%
Gender
Female
Male
Total: 4Accept: 3Reject: 1
Accuracy: 75%Coverage: 75%
Total: 6Accept: 1Reject: 5
Accuracy: 16.7%Coverage: 25%
Credit CardInsurance
Yes
No
Total: 2Accept: 2Reject: 0
Accuracy: 100%Coverage: 50%
Total: 2Accept: 1Reject: 1
Accuracy: 50%Coverage: 25%
What are the differences of this decision tree from the last one?
6
Confusion Matrix (Rule: “Gender=Female”)
ActualAccept
ActualReject
Computed Accept
Computed Reject
3
42
1
5Accuracy = 3 / (2+3)
=0.6
5
Coverage= 3 / (3 + 1)= 0.75
7
Confusion Matrix (Rule: “Credit Promotion = Yes”)
ActualAccept
ActualReject
Computed Accept
Computed Reject
3
51
1
4Accuracy = 3 / (1+3)
=0.75
6
Coverage= 3 / (3 + 1)= 0.75
8
Generalizing data analysis ideas
Question: How to useful rule from a large amount of data generated in business operations?
Answer: Applying data mining techniques/tools
9
What is Data Mining? (See Wikipedia data mining)
Many Definitions Non-trivial extraction of implicit, previously unknown
and potentially useful information from data Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in order to discover meaningful patterns
10
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality
of data Heterogeneous,
distributed nature of data
Origins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
11
Lots of data is being collected and warehoused Web data, e-commerce purchases at department/
grocery stores Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
12
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarray s generating gene
expression data scientific simulations
generating terabytes of data Traditional techniques infeasible for raw
data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
13
Data Mining Tasks
Prediction Methods Use some variables to predict unknown or
future values of other variables.
Description Methods Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
14
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
15
What Text Mining Is (See Wikipedia text mining)
Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.
“SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.” (SAS Text Miner: Distilling Textual Data for Competitive Business Advantage)
16
A simple text mining example
A tiny case - 9 documents deposit the cash and check in the bank - Fin the river boat is on the bank - Riv borrow based on credit - Fin river boat floats up the river - Riv boat is by the dock near the bank - Riv with credit, I can borrow cash from the bank - Fin boat floats by dock near the river bank - Riv check the parade route to see the floats - Par along the parade route - Par
17
Text Mining Strengths
Clustering documents in a corpus Investigating word (token) distribution across
documents within a corpus Identifying words with the highest discriminatory
power Classifying documents into predefined categories Integrating text data with structured data to enrich
predictive modeling endeavors
18
Text Mining Deficiencies
Text mining algorithms perform poorly in distinguishing negations, for example: Herman was involved in a motor vehicle accident. Herman was NOT involved in a motor vehicle accident
Text mining cannot generally make value judgments, for example, classifying an article as positive or negative with respect to any tokens it contains.
Text mining algorithms do not work well with large documents. Performance is slow. Increased term occurrence across documents decreases
separation of documents.
19
Using Data Mining Tools
Statistics Analysis System (http://www.sas.org) “SAS®9 is the most recent release of SAS. It delivers analytical, data manipulation and reporting capabilities within a completely new framework. ”
SPSS (http://www.spss.com) “SPSS customers include telecommunications, banking, finance, insurance, healthcare, manufacturing, retail, consumer packaged goods, higher education, government, and market research. ”
Weka, an open source software product (http://www.cs.waikato.ac.nz/ml/weka/ )
Microsoft SQL Server comes with major data mining utilities
There are more…
20
Using SAS Enterprise Mine to Construct A Decision Tree
21
SAS Enterprise Miner 4.3
Basic How to use the application main menu Using the pop-up menus Enterprise Miner documentation Project – Diagram
The SEMMA methodology Sample Explore Modify Model Assess
22
Exercise 5.0
Explore SAS and SAS Enterprise Miner
23
Decision Tree Example
Life Insurance Promotion Dataset CreditProm
24
Life Insurance Promotion Data
Income RangeMagazine Promo Watch Promo Life Ins Promo
Credit Card Ins. Sex Age
40-50,000 Yes No No No Male 45
30-40,000 Yes Yes Yes No Female 40
40-50,000 No No No No Male 42
30-40,000 Yes Yes Yes Yes Male 43
50-60,000 Yes No Yes No Female 38
20-30,000 No No No No Female 55
30-40,000 Yes No Yes Yes Male 35
20-30,000 No Yes No No Male 27
30-40,000 Yes No No No Male 43
30-40,000 Yes Yes Yes No Female 41
40-50,000 No Yes Yes No Female 43
20-30,000 No Yes Yes No Male 29
50-60,000 Yes Yes Yes No Female 39
40-50,000 No Yes No No Male 55
20-30,000 No No Yes Yes Female 19
25
Training Datax1
0.7
Missing in left branchMissing in right branch
Best Split x1
Tree Algorithm: Find Best Split for Input
X1 (Credit Prom)
Consider that the consumers in the life insurance promotion dataset havetwo attributes: credit card promotion, gender.
26
Training Datax2
0.7
Missing in left branchMissing in right branch
Logworth
Tree Algorithm: Repeat for Other Inputs
X2 (Gender)
Kass Adjusted
27
Training Data
0.7
Missing in left branchMissing in right branch
Best Split x2
Tree Algorithm: Compare Best Splits
x2
Best Split x1
x1
28
Training Data
Best Split
Tree Algorithm: Partition with Best Split
x1
x2
29
Training Data
Tree Algorithm: Repeat within Partitions
x1
x2
30
Training Data
Tree Algorithm: Partition with Best Split
x1
x2
31
Training Data
Tree Algorithm: Construct Maximal Tree
x1
x2
32
OverfittingOverfitting
Overfitting: The tree is split too much and the classification error rate is getting higher
We use training datasetto find the decision rules.These must be applicable to other datasets.
In order to test the validityof the rules, a test dataset is used.
Compare the outcomesbetween these two datasets, we can identify any inconsistency andcreate a good decision tree.
33
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
34
How to Address Overfitting
Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree We typically use two datasets:
Training dataset for growing the decision tree and obtaining rules
Test dataset for testing if the rules are good enough with regard to the errors rate when applying the rules from training dataset to the test dataset.
If there is no test dataset, the original dataset will be partitioned into two subsets for the above purpose.
35
Exercise 5
Download the Life Insurance Promotion dataset (CreditProm)
Import the data to SAS Try out SAS Decision Tree modeling
36
SAS Data Mining Example
A German Bank’s Credit Data Online SAS materials (View PDF (2.24MB))
P70, dataset description P71, decision matrix
37
Web Mining
38
Case study: CarPort.com
CarPort.com is a fictitious Web site that is used to illustrate
components of Web site design and Web log analysis
a services Web site.
39
CarPort.com
Visitor profile could be any of the following: 1. buyer looking for a car 2. seller looking to sell a used car 3. curious information seeker 4. competitor 5. robot or spider 6. lost Web surfer 7. SAS course developer.
40
CarPort.com
Services: car locator (want ads) car ownership information
Sources of revenue: banner ads used car ads partnership agreements (fee for referral)
41
How Did You Get Here?
Followed a link from another site Clicked on a banner ad Did a Google search Saw an advertisement on television, or heard
one on radio Received a direct mail solicitation Received a phone solicitation Heard the site mentioned or recommended on
a news or specialty program, or read about it in the printed media
42
Links
Banner Ad=
Link+Image
Title
URL
Images
43
Click on this link
to find out more
or e-mail the seller.
Link to dealer’s Web
site.
44
Web Mining for Profitability
Increase viewing, navigation, and transaction efficiency.
Improve the customer experience. Add services and features that promote cross-
selling and up-selling opportunities. Identify problem areas. Improve security. Attract more high quality customers.
45
Michael Berry’s Internet Business Taxonomy
Classification is based on an Internet company’s business model, which may include:selling things that get delivered in a truckselling things that get delivered through the etherselling eyes to advertisersconnecting sellers and buyersempowering communities and collecting donations.
46
Some Business Questions
Who is visiting my Web site? Who is buying my product(s)? Who are my repeat buyers? Which customers are churning? Which Web design produces the most
purchases? What campaign strategies are most effective
in increasing Web site visits?
47
More Questions
What factors influence product purchases?• Time-of-day effects• Gender, Age, Income, and so forth• Latent factors: e-shopper, Web expert, and so
forth Which sales channels produce the most
profitable customers? Do any site-visit patterns correlate with
outcomes that can be exploited for business advantage?
48
Web Log Fields User’s IP address, also called
Remote host name Client IP address
User name, also called Remote user log name (may be different) Authenticated user name
Date and time of request, with or without a UTC offset Request type, also called “method”
HTTP request with (CLF) or without (IIS) argument Status: HTTP three digit status code Number of bytes sent to client
continued...
49
Web Log Fields
The URL path requested, if request type has no argument The port to which the request was served The name of the server The IP address of the server The time taken to serve the request Number of bytes in the request received from the client User agent, which is usually a text string with the name
and version number of Web browser used by the client and the operating system of the client machine
The domain name or IP address of the referring URL Query information in a text string Cookie information in a text string
50
Common Log Format
ValueRemote Host Name
Remote User Log NameUsername
DateTime and UTC Offset
Request Type
Example111.22.333.44
-IRVINE/terry15/Apr/2000
11:28:14 -0700
GET /index.html HTTP/1.1
Service Status CodeBytes Sent
2002792
51
The User Session
WebServer
Browser
User requests index.htm.
Server sends copy of index.htm.
Browser parses index.htm,finds references to image files,and requests image files.
...
52
Association Rule Mining Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
53
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule An implication expression of the
form X Y, where X and Y are itemsets
Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics Support (s)
Fraction of transactions that contain both X and Y
Confidence (c) Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
54
Obtaining a Dataset from Web Log for SAS Data Analysis Example: IMW’s Web Log Data (raw data,
SAS dataset) Data Procession Skills
Converting the dataset into an Excel file Importing the data into SAS
55
SAS Association Model
56
Association Rules from IMW’s Dataset
57
Exercise 6
Download IMW’s Web Log raw data (raw data)
Data conversion within Excel Import the dataset to SAS Try out SAS Association Analysis model