Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an...
-
Upload
jeffry-day -
Category
Documents
-
view
214 -
download
0
Transcript of Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an...
Methodology
Qiang Yang, MTM521 Material
A High-level Process View for Data Mining
1. Develop an understanding of application, set goals, lay down all questions a user might pose as queries
2. Create dataset for study (from Data Warehouse, Web site, surveys)3. Data Cleaning and Preprocessing: 4. Data Reduction and projection5. Choose Data Mining task: blackbox or whitebox? Classification or
clustering? 6. Choose Data Mining algorithms: 7. Use algorithms to perform task8. Interpret, evaluation and cross validation, and iterate thru 1-7 if
necessary9. Deploy: integrate into operational systems, feedback and revise goals
and redo 1-9.
Case Study: German Bank Credit
Application Bank credit assessment Decision: Approval of loan or not approval of
loan Usage: Automatic Online Screening or Human
assistant Objective:
Accurate prediction of values Give reasons behind decision is important
Potential Queries
Who are likely to be approved loan? What are the most important characteristics
of an applicant to look at? What are the most indicative features for
yes/no answers What subset of customers to market to? And
what are the associated profit? Added: what advice to give to applicant to
improve chance in future?
Create Data Set for Study
Access to bank data warehouse or conduct a customer survey Cost of obtaining data must be factored
in? Likeliness of obtaining quality data in a
limited amount of time?
Questions to be Asked
Attribute 1: (qualitative) Status of existing checking
account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments
for at least 1 year A14 : no checking account
Attribute 2: (numerical) Duration in month
Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back
duly A31 : all credits at this bank paid back duly A32 : existing credits paid back
duly till now A33 : delay in paying off in the past A34 : critical account/ other credits
existing (not at this bank)
Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others
Attribute 5: (numerical) Credit amount
Attibute 6: (qualitative) Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account
Data Cleaning and Preprocessing:
What to do with missing values? How to fill in missing values and identify and
correct incorrect values? Do we know the cost of classification
mistakes? Do we know the cost of obtaining each
feature? How do we reduce noise? What are the
sources of noise for each attribute?
Rudimentary Analysis
What is the data distribution? How can you view data from different
angles? What does the rudimentary data
analysis tell you? Are you satisfied with the analysis?
Are there more queries that you cannot answer through this analysis?
Data reduction
How many data features do we want in the end?
Is it a data reduction problem or data transformation problem?
Is it supervised data reduction or unsupervised data reduction problem?
Is it linear data reduction or nonlinear data reduction problem?
Choose data mining task
Do we apply rule-based methods for better understanding?
Do we apply K Nearest neighbor methods for dense data sets?
Do we apply SVM methods for accuracy but for black-box models?
Is a final result (yes/no) important or the action important (what to do to reduce customer likelihood of being rejected?)
Use Algorithm to Perform Task
Which hardware platform to use? Which software platform to use? Is speed and scale more important
than visual effects? Is data porting issue important? Is API important or final answer
important? How much does each package cost?
Evaluation
Do we have separate training and testing data?
Is data scarce? What kind of cross validation do we use?
N folds, N=? Bootstrapping or not?
Is ranking important (lift, ROC) or confusion matrix important?
Interpretation
What does the results mean? Do we need to support causal effect of
the final decisions? Do we need to go back to experts in
the domain of application? Do we need visual effects or ranking of
final results?
Iteration
After obtaining one set of results, do we need to return to the beginning to revise our objectives and obtain new data?
How many iterations are needed? Is the process a one shot or
continuous process?
Deployment Issues
Do we need to integrate with a real online banking system?
Do we need to provide API for the software?
Do we need to use new data to supplement training data set? If so, how often?