Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an...

Methodology

Qiang Yang, MTM521 Material

A High-level Process View for Data Mining

1. Develop an understanding of application, set goals, lay down all questions a user might pose as queries

2. Create dataset for study (from Data Warehouse, Web site, surveys)3. Data Cleaning and Preprocessing: 4. Data Reduction and projection5. Choose Data Mining task: blackbox or whitebox? Classification or

clustering? 6. Choose Data Mining algorithms: 7. Use algorithms to perform task8. Interpret, evaluation and cross validation, and iterate thru 1-7 if

necessary9. Deploy: integrate into operational systems, feedback and revise goals

and redo 1-9.

Case Study: German Bank Credit

Application Bank credit assessment Decision: Approval of loan or not approval of

loan Usage: Automatic Online Screening or Human

assistant Objective:

Accurate prediction of values Give reasons behind decision is important

Potential Queries

Who are likely to be approved loan? What are the most important characteristics

of an applicant to look at? What are the most indicative features for

yes/no answers What subset of customers to market to? And

what are the associated profit? Added: what advice to give to applicant to

improve chance in future?

Create Data Set for Study

Access to bank data warehouse or conduct a customer survey Cost of obtaining data must be factored

in? Likeliness of obtaining quality data in a

limited amount of time?

Questions to be Asked

Attribute 1: (qualitative) Status of existing checking

account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments

for at least 1 year A14 : no checking account

Attribute 2: (numerical) Duration in month

Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back

duly A31 : all credits at this bank paid back duly A32 : existing credits paid back

duly till now A33 : delay in paying off in the past A34 : critical account/ other credits

existing (not at this bank)

Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account

Data Cleaning and Preprocessing:

What to do with missing values? How to fill in missing values and identify and

correct incorrect values? Do we know the cost of classification

mistakes? Do we know the cost of obtaining each

feature? How do we reduce noise? What are the

sources of noise for each attribute?

Rudimentary Analysis

What is the data distribution? How can you view data from different

angles? What does the rudimentary data

analysis tell you? Are you satisfied with the analysis?

Are there more queries that you cannot answer through this analysis?

Data reduction

How many data features do we want in the end?

Is it a data reduction problem or data transformation problem?

Is it supervised data reduction or unsupervised data reduction problem?

Is it linear data reduction or nonlinear data reduction problem?

Choose data mining task

Do we apply rule-based methods for better understanding?

Do we apply K Nearest neighbor methods for dense data sets?

Do we apply SVM methods for accuracy but for black-box models?

Is a final result (yes/no) important or the action important (what to do to reduce customer likelihood of being rejected?)

Use Algorithm to Perform Task

Which hardware platform to use? Which software platform to use? Is speed and scale more important

than visual effects? Is data porting issue important? Is API important or final answer

important? How much does each package cost?

Evaluation

Do we have separate training and testing data?

Is data scarce? What kind of cross validation do we use?

N folds, N=? Bootstrapping or not?

Is ranking important (lift, ROC) or confusion matrix important?

Interpretation

What does the results mean? Do we need to support causal effect of

the final decisions? Do we need to go back to experts in

the domain of application? Do we need visual effects or ranking of

final results?

Iteration

After obtaining one set of results, do we need to return to the beginning to revise our objectives and obtain new data?

How many iterations are needed? Is the process a one shot or

continuous process?

Deployment Issues

Do we need to integrate with a real online banking system?

Do we need to provide API for the software?

Do we need to use new data to supplement training data set? If so, how often?

Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an...

Documents

Transcript of Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an...