Cases: Banking credit record

1

Business System Analysis &

Decision Making – Data

Mining and Web Mining

Zhangxi LinISQS 5340

Summer II 2006

2

Outline

Introduction to data mining & text mining Constructing a decision tree using SAS

Enterprise Miner Web mining

3

Data Mining and Text Mining

4

Review - Decision Tree (1)

Total: 10Accept: 4Reject: 6

Accuracy: 40%Coverage: 100%

Gender

Female

Male





Credit CardInsurance

Yes

No




Accuracy: 33.3%Coverage: 25%

5

Review - Decision Tree (2)



Gender

Female

Male




Accuracy: 16.7%Coverage: 25%

Credit CardInsurance

Yes

No





What are the differences of this decision tree from the last one?

6

Confusion Matrix (Rule: “Gender=Female”)

ActualAccept

ActualReject

Computed Accept

Computed Reject

3

42

1

5Accuracy = 3 / (2+3)

=0.6

5

Coverage= 3 / (3 + 1)= 0.75

7

Confusion Matrix (Rule: “Credit Promotion = Yes”)

ActualAccept

ActualReject

Computed Accept

Computed Reject

3

51

1

4Accuracy = 3 / (1+3)

=0.75

6

Coverage= 3 / (3 + 1)= 0.75

8

Generalizing data analysis ideas

Question: How to useful rule from a large amount of data generated in business operations?

Answer: Applying data mining techniques/tools

9

What is Data Mining? (See Wikipedia data mining)

Many Definitions Non-trivial extraction of implicit, previously unknown

and potentially useful information from data Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

10

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality

of data Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

11

Lots of data is being collected and warehoused Web data, e-commerce purchases at department/

grocery stores Bank/Credit Card

transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Mine Data? Commercial Viewpoint

12

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarray s generating gene

expression data scientific simulations

generating terabytes of data Traditional techniques infeasible for raw

data Data mining may help scientists

in classifying and segmenting data in Hypothesis Formation

13

Data Mining Tasks

Prediction Methods Use some variables to predict unknown or

future values of other variables.

Description Methods Find human-interpretable patterns that

describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

14

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]

15

What Text Mining Is (See Wikipedia text mining)

Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.

“SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.” (SAS Text Miner: Distilling Textual Data for Competitive Business Advantage)

16

A simple text mining example

A tiny case - 9 documents deposit the cash and check in the bank - Fin the river boat is on the bank - Riv borrow based on credit - Fin river boat floats up the river - Riv boat is by the dock near the bank - Riv with credit, I can borrow cash from the bank - Fin boat floats by dock near the river bank - Riv check the parade route to see the floats - Par along the parade route - Par

17

Text Mining Strengths

Clustering documents in a corpus Investigating word (token) distribution across

documents within a corpus Identifying words with the highest discriminatory

power Classifying documents into predefined categories Integrating text data with structured data to enrich

predictive modeling endeavors

18

Text Mining Deficiencies

Text mining algorithms perform poorly in distinguishing negations, for example: Herman was involved in a motor vehicle accident. Herman was NOT involved in a motor vehicle accident

Text mining cannot generally make value judgments, for example, classifying an article as positive or negative with respect to any tokens it contains.

Text mining algorithms do not work well with large documents. Performance is slow. Increased term occurrence across documents decreases

separation of documents.

19

Using Data Mining Tools

Statistics Analysis System (http://www.sas.org) “SAS®9 is the most recent release of SAS. It delivers analytical, data manipulation and reporting capabilities within a completely new framework. ”

SPSS (http://www.spss.com) “SPSS customers include telecommunications, banking, finance, insurance, healthcare, manufacturing, retail, consumer packaged goods, higher education, government, and market research. ”

Weka, an open source software product (http://www.cs.waikato.ac.nz/ml/weka/ )

Microsoft SQL Server comes with major data mining utilities

There are more…

20

Using SAS Enterprise Mine to Construct A Decision Tree

21

SAS Enterprise Miner 4.3

Basic How to use the application main menu Using the pop-up menus Enterprise Miner documentation Project – Diagram

The SEMMA methodology Sample Explore Modify Model Assess

22

Exercise 5.0

Explore SAS and SAS Enterprise Miner

23

Decision Tree Example

Life Insurance Promotion Dataset CreditProm

24

Life Insurance Promotion Data

Income RangeMagazine Promo Watch Promo Life Ins Promo

Credit Card Ins. Sex Age

40-50,000 Yes No No No Male 45

30-40,000 Yes Yes Yes No Female 40

40-50,000 No No No No Male 42

30-40,000 Yes Yes Yes Yes Male 43

50-60,000 Yes No Yes No Female 38

20-30,000 No No No No Female 55

30-40,000 Yes No Yes Yes Male 35

20-30,000 No Yes No No Male 27

30-40,000 Yes No No No Male 43


40-50,000 No Yes Yes No Female 43

20-30,000 No Yes Yes No Male 29


40-50,000 No Yes No No Male 55

20-30,000 No No Yes Yes Female 19

25

Training Datax1

0.7

Missing in left branchMissing in right branch

Best Split x1

Tree Algorithm: Find Best Split for Input

X1 (Credit Prom)

Consider that the consumers in the life insurance promotion dataset havetwo attributes: credit card promotion, gender.

26

Training Datax2

0.7


Logworth

Tree Algorithm: Repeat for Other Inputs

X2 (Gender)

Kass Adjusted

27

Training Data

0.7


Best Split x2

Tree Algorithm: Compare Best Splits

x2

Best Split x1

x1

28

Training Data

Best Split

Tree Algorithm: Partition with Best Split

x1

x2

29

Training Data

Tree Algorithm: Repeat within Partitions

x1

x2

30

Training Data

Tree Algorithm: Partition with Best Split

x1

x2

31

Training Data

Tree Algorithm: Construct Maximal Tree

x1

x2

32

OverfittingOverfitting

Overfitting: The tree is split too much and the classification error rate is getting higher

We use training datasetto find the decision rules.These must be applicable to other datasets.

In order to test the validityof the rules, a test dataset is used.

Compare the outcomesbetween these two datasets, we can identify any inconsistency andcreate a good decision tree.

33

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

34

How to Address Overfitting

Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree We typically use two datasets:

Training dataset for growing the decision tree and obtaining rules

Test dataset for testing if the rules are good enough with regard to the errors rate when applying the rules from training dataset to the test dataset.

If there is no test dataset, the original dataset will be partitioned into two subsets for the above purpose.

35

Exercise 5

Download the Life Insurance Promotion dataset (CreditProm)

Import the data to SAS Try out SAS Decision Tree modeling

36

SAS Data Mining Example

A German Bank’s Credit Data Online SAS materials (View PDF (2.24MB))

P70, dataset description P71, decision matrix

37

Web Mining

38

Case study: CarPort.com

CarPort.com is a fictitious Web site that is used to illustrate

components of Web site design and Web log analysis

a services Web site.

39

CarPort.com

Visitor profile could be any of the following: 1. buyer looking for a car 2. seller looking to sell a used car 3. curious information seeker 4. competitor 5. robot or spider 6. lost Web surfer 7. SAS course developer.

40

CarPort.com

Services: car locator (want ads) car ownership information

Sources of revenue: banner ads used car ads partnership agreements (fee for referral)

41

How Did You Get Here?

Followed a link from another site Clicked on a banner ad Did a Google search Saw an advertisement on television, or heard

one on radio Received a direct mail solicitation Received a phone solicitation Heard the site mentioned or recommended on

a news or specialty program, or read about it in the printed media

42

Links

Banner Ad=

Link+Image

Title

URL

Images

43

Click on this link

to find out more

or e-mail the seller.

Link to dealer’s Web

site.

44

Web Mining for Profitability

Increase viewing, navigation, and transaction efficiency.

Improve the customer experience. Add services and features that promote cross-

selling and up-selling opportunities. Identify problem areas. Improve security. Attract more high quality customers.

45

Michael Berry’s Internet Business Taxonomy

Classification is based on an Internet company’s business model, which may include:selling things that get delivered in a truckselling things that get delivered through the etherselling eyes to advertisersconnecting sellers and buyersempowering communities and collecting donations.

46

Some Business Questions

Who is visiting my Web site? Who is buying my product(s)? Who are my repeat buyers? Which customers are churning? Which Web design produces the most

purchases? What campaign strategies are most effective

in increasing Web site visits?

47

More Questions

What factors influence product purchases?• Time-of-day effects• Gender, Age, Income, and so forth• Latent factors: e-shopper, Web expert, and so

forth Which sales channels produce the most

profitable customers? Do any site-visit patterns correlate with

outcomes that can be exploited for business advantage?

48

Web Log Fields User’s IP address, also called

Remote host name Client IP address

User name, also called Remote user log name (may be different) Authenticated user name

Date and time of request, with or without a UTC offset Request type, also called “method”

HTTP request with (CLF) or without (IIS) argument Status: HTTP three digit status code Number of bytes sent to client

continued...

49

Web Log Fields

The URL path requested, if request type has no argument The port to which the request was served The name of the server The IP address of the server The time taken to serve the request Number of bytes in the request received from the client User agent, which is usually a text string with the name

and version number of Web browser used by the client and the operating system of the client machine

The domain name or IP address of the referring URL Query information in a text string Cookie information in a text string

50

Common Log Format

ValueRemote Host Name

Remote User Log NameUsername

DateTime and UTC Offset

Request Type

Example111.22.333.44

-IRVINE/terry15/Apr/2000

11:28:14 -0700

GET /index.html HTTP/1.1

Service Status CodeBytes Sent

2002792

51

The User Session

WebServer

Browser

User requests index.htm.

Server sends copy of index.htm.

Browser parses index.htm,finds references to image files,and requests image files.

...

52

Association Rule Mining Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

53

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the

form X Y, where X and Y are itemsets

Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

54

Obtaining a Dataset from Web Log for SAS Data Analysis Example: IMW’s Web Log Data (raw data,

SAS dataset) Data Procession Skills

Converting the dataset into an Excel file Importing the data into SAS

55

SAS Association Model

56

Association Rules from IMW’s Dataset

57

Exercise 6

Download IMW’s Web Log raw data (raw data)

Data conversion within Excel Import the dataset to SAS Try out SAS Association Analysis model

Cases: Banking credit record

Documents

Transcript of Cases: Banking credit record