An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...

DM22011/2012. Alfredo Vellido/

An Introduction to Mining (2‐3)An Introduction to Mining (2 3)

RECAP: RECAP: What’s DATA MINING?: A procedural viewpoint

RECAP:RECAP: What’s DATA MINING?: A historicist viewpointRECAP: RECAP: What s DATA MINING?: A historicist viewpoint

ESTADÍSTICASTATISTICS DM

PATT RECOG

EXPERT

KDD

ARTIFICIALINTELLIGENCE

EXPERT SYSTEMSMACHINE LEARNING

DB MANAGEMENT

RECAP:RECAP: CRISP‐DM: Methodology loopRECAP: RECAP: CRISP DM: Methodology loop

CRISP: Phases: Problem understandingCRISP: Phases: Problem understanding

PROBLEM UNDERSTANDING

DATA

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

DETERMINEPROBLEMGOAL

BACKGROUNDPROBLEM

GOALS

SUCCESS

CRITERIA

ASSESS SITUATION

INVENTORY RESOURCES

REQUERIMS. ASSUMPTIONS LIMITATIONS

RISKS CONTINGEN.

TERMINOLOG.COSTS & BENEFITS

DETERMINEDM

GOALS

GOALS DM SUCCESS CRITERIA DM

PRODUCE PROJECTPLAN

PROJECT

PLAN

INITIAL SELECTION OF

TOOLS

CRISP: Phases: Data understandingCRISP: Phases: Data understanding


DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

OBTAIN INITIAL DATA

INITIAL DATA REPORT

DESCRIPTION DATA

DATA DESCRIPTIVE REPORT

EXPLORATION DATA

REPORT

DATA EXPLORATION

REPORT

VERIFICATION QUALITY DATA

REPORT

DATA QUALITY REPORT

METROFANG: a real story about data understanding (1)

METROFANG: a real story about data understanding (2)

caudal entrada

200,00

250,00

300,00

350,00 Missing data

Stationality

0,00

50,00

100,00

150,00

200,00

Outliers

Time Series 0,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Par motor Secador A

Weekend?

FORUM???

80 00

100,00

120,00

140,00

20,00

40,00

60,00

80,00

0,001 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Storing data (’07)g ( )

Poll

What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes][ , ]

Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1%Excel (28) 19.7% Oracle (25) 17.6% SQL Server (15) 10 6%SQL Server (15) 10.6%mySQL (12) 8.5% other format (10) 7.0% other commercial DBMS (7) 4 9%other commercial DBMS (7) 4.9%other free DBMS (4) 2.8%

CRISP: Phases: Data preparationCRISP: Phases: Data preparation

PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING

DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

DATA SELECTION

ARGUMENTS FOR SELECTION

DATA CLEANING

RECONSTRUCT

DATA CLEANING REPORT

DERIVATED OSERVATIONS DATA

INTEGRATE DATA

VARIABLES

INTEGRATED DATA

GENERATED

DATA FORMATTING DATA WITH NEW

FORMAT

Is data preparation that important?Is data preparation that important?

What % of time in your data mining project(s) is spent on data cleaning and y g p j ( ) p gpreparation [187 votes total]

over 80% (46) 25% 61 to 80% (73) 39% 41 to 60% (46) 25% 21 to 40% (7) 4%21 to 40% (7) 4%20% or less (15) 8%

Data manipulation tools …(’07)p ( )

How large is it? … (’06 → ‘09)Largest database or dataset you data-mined was [181 votes total]

less than 1 MB (5) 3% 1.1 to 10 MB (11) 6%11 to 100 MB (27) 15% 101 MB to 1 GB (22) 12% 1.1 to 10 GB (45) 25% ( ) %11 to 100 GB (22) 12% 101 GB to 1 Terabyte (28) 15% over 1 Terabyte (21) 12%

CRISP: Phases: Modellingg


DATA DATAMODELLING EVALUATION

IMPLEMENUNDERSTANDING

UNDERST’ING PREPARATIONMODELLING EVALUATION

TATION

SELECT MODELINGTECHNIQUE

SELECTED

TECHNIQUE

CREATE TEST DESIGN

TEST DESIGN

BUILDMODEL

PARAMETER SELECTION

MODEL MODEL DESCRIPTION

VALIDATE MODEL

MODEL VALIDATION

CRISP A typology of DM problemsCRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY

and DESCRIPTION

Compact and aggregated data description. Exploratory Analysis

Almost any problem includes some elements of data description

ERPs, stats., OLAP, EIS, control dashboards

SEGMENTATIONFinding data groups (unsupervised) Market Segmentation, Clustering, NNs

(SOM GTM)SEGMENTATION (unsupervised) segm / clust / classif Shopping Basket analysis (SOM, GTM),

visualización

CONCEPTUAL DESCRIPTION

Accessible and useful description of concepts / classes / groups. Knowledge

fi t th i ió

Ex.: Description of customer groups according to loyalty. Rule segment profiling

Rule Induction, Conceptual Cl t iDESCRIPTION comes first, then precissión.

Linked to clasif / segmentation segment profilingif SEX=male and age>45 then CUST=loyal

Clustering

CLASIFICATION Assumed that different ítems can be assigned to a given Bankruptcy prediction,

Credit Scoring

Discriminant Analysis, Rule Induction, Decision g g

closed cathegory (supervised) Credit Scoring ,Trees, NNs, C-B Reasoning, GAs

PREDICTION (REGRESSION, FORECASTING)

Continuous dependent variable. Given values of the predictive variables, predict

Markets, company benefit pred., Market share forec.

Regression Analysis, Regression Trees, NNs, Box-Jenkins,

FORECASTING) p p(supervisado)

pGAs

DEPENDENCY ANALYSIS

Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation

Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …

Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.

CRISP: Selection of modeling techniquesCRISP: Selection of modeling techniques

U N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E S(Definided by tools)

TECHNIQUES SUITED TO A PROBLEM

POLITICALPOLITICAL REQUIREMENTS

(Business, executive)

LIMITATIONS

Data types, knowledgeData types, knowledgeMoney, time, hh.rr.Money, time, hh.rr.

SELECTEDSELECTED TOOL(S)TOOL(S)SELECTEDSELECTED TOOL(S)TOOL(S)

Commonly used models/techniques (‘05)Commonly used models/techniques ( 05)…

Data mining/analytic techniques you use frequently: [784 votes total] Decision Trees/Rules (107) 14% Clustering (101) 13% Regression (90) 11%g ( )Statistics (80) 10% Visualization (63) 8% Neural Nets (61) 8%( )Association rules (54) 7% Nearest Neighbor (34) 4% SVM (Support vector machine) (31) 4%( pp ) ( )Bayesian (30) 4% Sequence/Time series analysis (26) 3% Boosting (25) 3%g ( )Hybrid methods (23) 3% Bagging (20) 3% Genetic algorithms (19) 2%g ( )Other (20) 3%

Commonly used models/techniques (‘07)Commonly used models/techniques ( 07)…

CRISP: Phases: EvaluationCRISP: Phases: Evaluation

PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING

DATA

UNDERST’ING

DATA


IMPLEMEN

TATION

EVALUATE RESULTS

EVOLUTION OF DM RESULTS

APPROVED MODELS

REVISE PROCESSES

REVISION OF THE PROCESS

DETERMINE NEXT STEPS

LIST OF POSSIBLE ACTIONS

DECISSIONS

CRISP: Phases: DeploymentC S ases ep oy e t

PROBLEM DATA DATAMODELLING EVALUATION

IMPLEMENUNDERSTANDING

UNDERST’ING PREPARATIONMODELLING EVALUATION

TATION

PLAN IMPLEMENTATION

IMPLEMENTATION PLAN

PLAN MONITORIZATION &

MAINTENANCE

MONITORIZATION & MAINTENANCE PLAN

PRODUCIR INFORME FINAL

FINAL REPORT FINAL PRESENTATION

REVISAR PROYECTO

DOCUMENTATION OF EXPERIENCE

How do you deploy it? (’06‐>’09)How do you usually deploy data mining results? (Choose all that apply): [95 voters]

Publish research papers (37) 38.9% Use findings to change business rules (42) 44.2% ===

Deploy in production and ... (46) 48.4% Use data mining tool for scoring (47) 49.5% Convert model to SQL (20) 21.1%Convert model to another language (16) 16.8% Convert model to C or Java (16) 16.8% Convert model to PMML (4) 4.2% === Deploy in batch mode (48) 50.5% Deploy in real-time mode (21) 22.1%

Software popularity (‘07)

Free vs commercial:Free vs. commercial:

debate

Software popularity (→‘09)

Ouch! (→‘10)

A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology

Why?y

Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:

The availability of new types of data—text Web and attitudinal data for example—along with newThe availability of new types of data text, Web, and attitudinal data, for example along with new techniques for pre‐processing, analyzing, and combining them with related case data

Integration and deployment of results with operational systems such as call centers and Web sites

Far more demanding requirements for scalability and for deployment into real‐time environmentsFar more demanding requirements for scalability and for deployment into real time environments

The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows

The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing y g p y p gbusiness processes

The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analyticsand best practices associated with data mining and predictive analytics

I J l 2006 th ti d th t it i t t t th f ki t d d i fIn July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.

ResourcesResourcesesou esesou es

• Webs:– www.kdnuggets.com

– http://the‐data‐mine.com

– www.sigkdd.org

• Free software:– www.keel.es

– http://www.cs.waikato.ac.nz/ml/weka/

http://rapid i com– http://rapid‐i.com

Some bibliography available at books.google.com:

Data mining: practical machine learning tools and techniques

I.H. Witten, E. Frank (2005)

Data mining: concepts and techniques

J. Han, M. Kamber (2006)

i i l f d i iPrinciples of data mining

D. J. Hand, H. Mannila, P. Smyth (2001)

An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...

Documents

Transcript of An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...