An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...
Transcript of An Introduction to Mining (2 3)avellido/teaching/11-12/IntroDM... · 2011-10-18 · Clustering, NNs...
DM22011/2012. Alfredo Vellido/
An Introduction to Mining (2‐3)An Introduction to Mining (2 3)
RECAP: RECAP: What’s DATA MINING?: A procedural viewpoint
RECAP:RECAP: What’s DATA MINING?: A historicist viewpointRECAP: RECAP: What s DATA MINING?: A historicist viewpoint
ESTADÍSTICASTATISTICS DM
PATT RECOG
EXPERT
KDD
ARTIFICIALINTELLIGENCE
EXPERT SYSTEMSMACHINE LEARNING
DB MANAGEMENT
RECAP:RECAP: CRISP‐DM: Methodology loopRECAP: RECAP: CRISP DM: Methodology loop
CRISP: Phases: Problem understandingCRISP: Phases: Problem understanding
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
DETERMINEPROBLEMGOAL
BACKGROUNDPROBLEM
GOALS
SUCCESS
CRITERIA
ASSESS SITUATION
INVENTORY RESOURCES
REQUERIMS. ASSUMPTIONS LIMITATIONS
RISKS CONTINGEN.
TERMINOLOG.COSTS & BENEFITS
DETERMINEDM
GOALS
GOALS DM SUCCESS CRITERIA DM
PRODUCE PROJECTPLAN
PROJECT
PLAN
INITIAL SELECTION OF
TOOLS
CRISP: Phases: Data understandingCRISP: Phases: Data understanding
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
OBTAIN INITIAL DATA
INITIAL DATA REPORT
DESCRIPTION DATA
DATA DESCRIPTIVE REPORT
EXPLORATION DATA
REPORT
DATA EXPLORATION
REPORT
VERIFICATION QUALITY DATA
REPORT
DATA QUALITY REPORT
METROFANG: a real story about data understanding (1)
METROFANG: a real story about data understanding (2)
caudal entrada
200,00
250,00
300,00
350,00 Missing data
Stationality
0,00
50,00
100,00
150,00
200,00
Outliers
Time Series 0,00
1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Par motor Secador A
Weekend?
FORUM???
80 00
100,00
120,00
140,00
20,00
40,00
60,00
80,00
0,001 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Storing data (’07)g ( )
Poll
What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes][ , ]
Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1%Excel (28) 19.7% Oracle (25) 17.6% SQL Server (15) 10 6%SQL Server (15) 10.6%mySQL (12) 8.5% other format (10) 7.0% other commercial DBMS (7) 4 9%other commercial DBMS (7) 4.9%other free DBMS (4) 2.8%
CRISP: Phases: Data preparationCRISP: Phases: Data preparation
PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
DATA SELECTION
ARGUMENTS FOR SELECTION
DATA CLEANING
RECONSTRUCT
DATA CLEANING REPORT
DERIVATED OSERVATIONS DATA
INTEGRATE DATA
VARIABLES
INTEGRATED DATA
GENERATED
DATA FORMATTING DATA WITH NEW
FORMAT
Is data preparation that important?Is data preparation that important?
What % of time in your data mining project(s) is spent on data cleaning and y g p j ( ) p gpreparation [187 votes total]
over 80% (46) 25% 61 to 80% (73) 39% 41 to 60% (46) 25% 21 to 40% (7) 4%21 to 40% (7) 4%20% or less (15) 8%
Data manipulation tools …(’07)p ( )
How large is it? … (’06 → ‘09)Largest database or dataset you data-mined was [181 votes total]
less than 1 MB (5) 3% 1.1 to 10 MB (11) 6%11 to 100 MB (27) 15% 101 MB to 1 GB (22) 12% 1.1 to 10 GB (45) 25% ( ) %11 to 100 GB (22) 12% 101 GB to 1 Terabyte (28) 15% over 1 Terabyte (21) 12%
CRISP: Phases: Modellingg
PROBLEM UNDERSTANDING
DATA DATAMODELLING EVALUATION
IMPLEMENUNDERSTANDING
UNDERST’ING PREPARATIONMODELLING EVALUATION
TATION
SELECT MODELINGTECHNIQUE
SELECTED
TECHNIQUE
CREATE TEST DESIGN
TEST DESIGN
BUILDMODEL
PARAMETER SELECTION
MODEL MODEL DESCRIPTION
VALIDATE MODEL
MODEL VALIDATION
CRISP A typology of DM problemsCRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY
and DESCRIPTION
Compact and aggregated data description. Exploratory Analysis
Almost any problem includes some elements of data description
ERPs, stats., OLAP, EIS, control dashboards
SEGMENTATIONFinding data groups (unsupervised) Market Segmentation, Clustering, NNs
(SOM GTM)SEGMENTATION (unsupervised) segm / clust / classif Shopping Basket analysis (SOM, GTM),
visualización
CONCEPTUAL DESCRIPTION
Accessible and useful description of concepts / classes / groups. Knowledge
fi t th i ió
Ex.: Description of customer groups according to loyalty. Rule segment profiling
Rule Induction, Conceptual Cl t iDESCRIPTION comes first, then precissión.
Linked to clasif / segmentation segment profilingif SEX=male and age>45 then CUST=loyal
Clustering
CLASIFICATION Assumed that different ítems can be assigned to a given Bankruptcy prediction,
Credit Scoring
Discriminant Analysis, Rule Induction, Decision g g
closed cathegory (supervised) Credit Scoring ,Trees, NNs, C-B Reasoning, GAs
PREDICTION (REGRESSION, FORECASTING)
Continuous dependent variable. Given values of the predictive variables, predict
Markets, company benefit pred., Market share forec.
Regression Analysis, Regression Trees, NNs, Box-Jenkins,
FORECASTING) p p(supervisado)
pGAs
DEPENDENCY ANALYSIS
Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation
Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …
Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.
CRISP: Selection of modeling techniquesCRISP: Selection of modeling techniques
U N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E SU N I V E R S E OF T E C H N I Q U E S(Definided by tools)
TECHNIQUES SUITED TO A PROBLEM
POLITICALPOLITICAL REQUIREMENTS
(Business, executive)
LIMITATIONS
Data types, knowledgeData types, knowledgeMoney, time, hh.rr.Money, time, hh.rr.
SELECTEDSELECTED TOOL(S)TOOL(S)SELECTEDSELECTED TOOL(S)TOOL(S)
Commonly used models/techniques (‘05)Commonly used models/techniques ( 05)…
Data mining/analytic techniques you use frequently: [784 votes total] Decision Trees/Rules (107) 14% Clustering (101) 13% Regression (90) 11%g ( )Statistics (80) 10% Visualization (63) 8% Neural Nets (61) 8%( )Association rules (54) 7% Nearest Neighbor (34) 4% SVM (Support vector machine) (31) 4%( pp ) ( )Bayesian (30) 4% Sequence/Time series analysis (26) 3% Boosting (25) 3%g ( )Hybrid methods (23) 3% Bagging (20) 3% Genetic algorithms (19) 2%g ( )Other (20) 3%
Commonly used models/techniques (‘07)Commonly used models/techniques ( 07)…
CRISP: Phases: EvaluationCRISP: Phases: Evaluation
PROBLEM DATA DATA IMPLEMENPROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
EVALUATE RESULTS
EVOLUTION OF DM RESULTS
APPROVED MODELS
REVISE PROCESSES
REVISION OF THE PROCESS
DETERMINE NEXT STEPS
LIST OF POSSIBLE ACTIONS
DECISSIONS
CRISP: Phases: DeploymentC S ases ep oy e t
PROBLEM DATA DATAMODELLING EVALUATION
IMPLEMENUNDERSTANDING
UNDERST’ING PREPARATIONMODELLING EVALUATION
TATION
PLAN IMPLEMENTATION
IMPLEMENTATION PLAN
PLAN MONITORIZATION &
MAINTENANCE
MONITORIZATION & MAINTENANCE PLAN
PRODUCIR INFORME FINAL
FINAL REPORT FINAL PRESENTATION
REVISAR PROYECTO
DOCUMENTATION OF EXPERIENCE
How do you deploy it? (’06‐>’09)How do you usually deploy data mining results? (Choose all that apply): [95 voters]
Publish research papers (37) 38.9% Use findings to change business rules (42) 44.2% ===
Deploy in production and ... (46) 48.4% Use data mining tool for scoring (47) 49.5% Convert model to SQL (20) 21.1%Convert model to another language (16) 16.8% Convert model to C or Java (16) 16.8% Convert model to PMML (4) 4.2% === Deploy in batch mode (48) 50.5% Deploy in real-time mode (21) 22.1%
Software popularity (‘07)
Free vs commercial:Free vs. commercial:
debate
Software popularity (→‘09)
Ouch! (→‘10)
A note on CRISP-DM 2.0CRISP-2.0: Updating the Methodology
Why?y
Many changes have occurred in the business application of data mining since CRISP‐DM 1.0 was published. Emerging issues and requirements include:
The availability of new types of data—text Web and attitudinal data for example—along with newThe availability of new types of data text, Web, and attitudinal data, for example along with new techniques for pre‐processing, analyzing, and combining them with related case data
Integration and deployment of results with operational systems such as call centers and Web sites
Far more demanding requirements for scalability and for deployment into real‐time environmentsFar more demanding requirements for scalability and for deployment into real time environments
The need to package analytical tasks for non‐analytical end users and integrate these tasks in business workflows
The need to seamlessly integrate the deployment of results and closed‐loop feedback with existing y g p y p gbusiness processes
The need to mine large‐scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analyticsand best practices associated with data mining and predictive analytics
I J l 2006 th ti d th t it i t t t th f ki t d d i fIn July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP‐DM. On 26 September 2006, the CRISP‐DM SIG met to discuss potential enhancements for CRISP‐DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.
ResourcesResourcesesou esesou es
• Webs:– www.kdnuggets.com
– http://the‐data‐mine.com
– www.sigkdd.org
• Free software:– www.keel.es
– http://www.cs.waikato.ac.nz/ml/weka/
http://rapid i com– http://rapid‐i.com
Some bibliography available at books.google.com:
Data mining: practical machine learning tools and techniques
I.H. Witten, E. Frank (2005)
Data mining: concepts and techniques
J. Han, M. Kamber (2006)
i i l f d i iPrinciples of data mining
D. J. Hand, H. Mannila, P. Smyth (2001)